Paper Review: RETuning Boosts Large‑Model Stock Trend Prediction Reasoning
This article analyzes the RETuning framework, which addresses LLMs' bias toward analyst opinions and lack of evidence weighting in stock movement prediction by introducing a two‑stage cold‑start fine‑tuning and reinforcement learning pipeline, evaluating it on the large Fin‑2024 dataset and demonstrating significant F1 gains, inference‑time scaling, and out‑of‑distribution robustness.
Background Large language models (LLMs) have shown strong reasoning abilities in mathematics and code generation, yet their potential for financial tasks such as three‑class stock movement prediction (up/hold/down) remains under‑explored. Existing financial datasets are outdated or limited, and LLMs suffer from two main issues: (1) context bias toward analyst viewpoints, and (2) neglect of opposing evidence, which hampers robust decision‑making.
Problem Definition The authors identify three core challenges for LLM‑based stock prediction: context opinion dependence, ignoring counter‑evidence, and under‑utilized reasoning capability.
Method – RETuning
RETuning is a two‑stage framework consisting of a cold‑start supervised fine‑tuning (SFT) phase and a reinforcement learning phase called GRPO.
Cold‑start SFT treats stock prediction as a generative reasoning task. It guides LLMs through:
Task Understanding : predict the price change between the previous close and the next open, using a ±3% threshold for classification.
Analysis Framework Construction : dynamically build a multi‑dimensional framework (fundamentals, news trends, macro signals, etc.) independent of analyst comments.
Evidence Extraction & Scoring : extract supporting evidence from multiple sources and assign a 0‑10 score (e.g., policy support +9, fundamental loss +7).
Reflection & Reconciliation : apply hypothesis testing and market simulation to reconcile conflicting evidence and adjust scores.
Structured Output : generate a response containing the reasoning process, evidence scores ( <score>[a,b]</score>), predicted percentage change ( <pct_change>0.xxxx</pct_change>), and direction tag ( <up/down/hold>).
The SFT dataset is built via a semi‑automatic pipeline: 300 samples are generated with the 671B DeepSeek‑R1 model, filtered through format validation, prediction consistency checks, and manual review, yielding 188 high‑quality cold‑start samples that are merged with 10K generic reasoning data to avoid catastrophic forgetting.
Reinforcement Learning – GRPO further aligns model behavior through a reward function that combines format correctness, accuracy of direction, and consistency between predicted change and direction (weighted by hyper‑parameters α, β, γ). Curriculum learning selects medium‑difficulty samples based on prediction error frequency, focusing training on informative signals. Inference‑time scaling is achieved by temperature‑0.6 repeated sampling (n = 1,2,4,8,16,32) with majority voting for the final prediction.
Dataset – Fin‑2024 Covers 5,123 A‑shares in 2024, providing 209,063 samples with 32K‑token contexts. It integrates price, news, analyst opinions, quantitative reports, fundamentals, macro indicators, and similar‑stock information. Training data span January–November, test data are from December, and a long‑term evaluation set uses June 2025 data.
Experimental Setup
Baselines: LLMFactor, Fino1, Fin‑R1, StockNet, and mainstream LLMs (DeepSeek, Qwen‑3, GPT‑OSS).
Metrics: three‑class F1 (balanced), inference‑time scaling effect (varying n), and out‑of‑distribution (OOD) generalization on stock, date, and combined shifts.
Results
On Fin‑2024[December], DeepSeek_R1_14B_SFT_GRPO achieves F1 = 0.4196, a 20.75% improvement over the baseline and 22.15% over the best public model (GPT‑OSS‑120B + CoT). The 32B variant reaches F1 = 0.4071 (+14.13%).
Inference‑time scaling: increasing n to 32 raises F1 from 0.3475 to 0.4196 for the 14B model; the improvement persists six months later on Fin‑2025[June].
OOD performance: on unseen stocks, the 32B_SFT_GRPO (n = 32) attains F1 = 0.45, outperforming the 14B model by 0.07; on unseen dates, the peak F1 is 0.42, indicating temporal shift difficulty; on combined stock‑date OOD, F1 reaches 0.50, showing synergy between model size and scaling.
Cross‑task generalization on BizFinBench: RETuning raises average scores from 59.49 to 66.92 (14B) and from 66.29 to 70.44 (32B), entering the top‑3 on tasks such as Financial Numerical Computation and Financial Tool Use.
Ablation: pure CoT prompts benefit only large models (Qwen‑3‑32B, GPT‑OSS‑120B); RETuning’s SFT + GRPO consistently outperforms CoT alone.
Overall, RETuning demonstrates that a reflective evidence‑driven fine‑tuning stage followed by a rule‑based reinforcement learning stage can unlock LLMs’ latent reasoning power for financial forecasting, improve inference‑time scaling, and enhance robustness to distribution shifts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
