How VTA Combines Large‑Model Reasoning for Precise and Explainable Stock Time‑Series Forecasting
The VTA framework integrates large language model reasoning with textual annotation of technical indicators, employs a Time‑GRPO reinforcement‑learning objective and multi‑stage joint conditional training, and achieves state‑of‑the‑art accuracy and expert‑rated interpretability on US, Chinese and European stock datasets.
Background
Large language models (LLMs) have been applied to financial analysis tasks such as question answering and sentiment analysis, but most approaches focus on textual reports and ignore historical price data (technical analysis). This gap motivates a study of how to use LLMs for language‑based reasoning over stock price time series while preserving interpretability.
Challenges
Limited time‑series prediction capability of current LLMs; modifying embedding spaces harms natural‑language interpretability.
Lack of unsupervised language reasoning over time‑series signals.
Difficulty converting the token‑prediction paradigm of LLMs into accurate numeric time‑series outputs.
Problem Definition
The goal is to enable LLMs to perform language reasoning on financial time‑series inputs and generate accurate, explainable stock forecasts. Specifically, the sub‑problems are:
How to let LLMs reason over time‑series inputs to produce interpretable predictions?
How to convert the LLM reasoning process into accurate numeric forecasts?
How to maintain both prediction accuracy and model interpretability?
Method: Verbal Technical Analysis (VTA)
VTA consists of three components: time‑series reasoning, time‑series prediction, and joint conditional training.
3.1 Problem Formulation
Given a historical window of length T, the input vector X = {x_{t‑T+1}, …, x_t} where each x_t = [o_t, h_t, l_t, v_t, c_t, p_t] (open, high, low, volume, close, adjusted close). The target is to produce a language reasoning trace v and a future price trajectory y = {p_{t+1}, …, p_{t+T'}}.
3.2 Time‑Series Reasoning
The process converts raw price data into textual annotations and trains the LLM to predict future series conditioned on these annotations.
Text annotation : A function f maps X to X' = f(X), embedding statistical summaries (mean, min, max) and technical indicators (moving averages, momentum, etc.).
Training objective : Prompt q combines X and X'. The LLM outputs o containing a predicted series y_θ and a reasoning trace v_θ. Optimization uses the Time‑Series Group Relative Policy Optimization (Time‑GRPO) objective, which incorporates an inverse MSE reward r_{MSE} and additional MSE‑based reward weighted by λ.
Training pipeline : Multi‑stage fine‑tuning.
Cold‑start stage generates initial samples guided by L_{time‑grpo} because no gold‑standard supervision exists.
Effective‑reasoning stage applies rejection sampling, retaining only samples whose MSE falls in the lowest 10 % per stock and time bucket, then performs supervised fine‑tuning (SFT) on this filtered set.
Best‑prediction stage continues to optimize L_{time‑grpo} to maximize expected forecasting accuracy.
3.3 Time‑Series Prediction
A transformer‑based time‑series model aligns the time‑series distribution with the language distribution.
Embedding layer and multi‑head attention project X to a time token X_{time}.
PCA extracts principal word embeddings D from the LLM’s embedding space.
Multi‑head cross‑attention aligns X_{time} with D, producing cross‑modal tokens.
Subsequent LLM transformer blocks process the combined tokens, with feature‑regularization loss matching time and text branches.
A final dense layer outputs the time‑aligned prediction y_{time}, which becomes the forecast y_φ(X).
3.4 Joint Conditional Training
To preserve interpretability while improving accuracy, the time‑series predictor is conditioned on the reasoning output.
Extract reasoning output o from the fine‑tuned inference policy π_θ.
Derive descriptive attribute classes c (e.g., max, min, mean) from the generated series.
Concatenate c with the predictor’s output and pass through a linear layer, then aggregate via a projection layer to obtain the conditional prediction y_ψ(X, c).
Train both conditional and unconditional branches jointly, using a probability p_{uncond}=0.3 to replace c with an unconditional token, and optimize with MSE loss.
Experiments
Datasets
ACL18 StockNet: 88 US stocks (top‑10 market‑cap per sector) from 2012‑09‑01 to 2017‑09‑01.
Additional data from US, China, and Europe indices (2024‑01‑01 to 2025‑01‑01).
Baselines
Traditional time‑series models: Transformer, Reformer, Informer, Autoformer, DLinear, FiLM, Crossformer, MICN, LightTS, TimesNet, TSMixer, Non‑Stationary Transformer.
LLM‑based time‑series models: TimeLLM, CALF.
Explainable LLMs: GPT‑4.1 mini, DeepSeek‑R1.
Implementation Details
All LLMs (inference and predictor) are fine‑tuned with LoRA.
Input and output lengths T = T' = 10 (short‑term forecasting).
Inference model: Qwen2.5‑7B‑Instruct; predictor model: GPT‑2.
Hyper‑parameters: p_{uncond}=0.3, guidance scale s=0.1.
Results
Inference LLMs : GPT‑4.1 mini and DeepSeek‑R1, though not fine‑tuned for forecasting, outperform several non‑LLM baselines, showing the value of language reasoning over price data.
Time‑series baselines : Models that decompose trend and seasonality (FiLM, MICN, Autoformer, TimesNet) achieve large gains, reflecting the non‑stationary nature of stock prices.
LLM‑based time‑series models : TimeLLM and CALF surpass non‑LLM baselines, likely due to LLMs’ embedded financial knowledge.
VTA : Achieves the lowest MSE and MAE, combining latent (internal) understanding with explicit (language) reasoning. It also provides interpretable reasoning traces, unlike most baselines.
Ablation Studies
Inverse MSE reward : The reward r_{MSE} grows with training steps, indicating the model learns the language reasoning steps for time‑series prediction.
Fine‑tuning stages : The first RL stage (Time‑GRPO) provides data for later stages. After rejection sampling and SFT, the second RL stage improves performance by ~20.3 % on average, demonstrating the benefit of multi‑stage pipelines.
Conditional training : Adding conditional training on the extra predictor further boosts performance, confirming the effectiveness of jointly conditioning on reasoning outputs.
Inference Quality Evaluation
Following prior work on LLM explainability, five metrics (clarity, depth, accuracy, coherence, relevance) were assessed by 25 finance‑industry experts. VTA received the highest average scores on all metrics, with notable gains in depth, accuracy, and relevance.
Portfolio Optimization
Using a 10‑day Markowitz optimization, portfolios built on VTA forecasts were compared against those built on other models across return, volatility, max drawdown, and Sharpe ratio. VTA‑based portfolios achieved superior returns, lower volatility, smaller drawdowns, and the highest Sharpe ratio among all methods.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
