Trading-R1: Open-Source LLM Framework for Explainable Financial Trading

This article reviews Trading‑R1, an open‑source LLM inference framework that integrates multimodal financial data, three‑stage supervised‑fine‑tuning and reinforcement learning to generate structured investment arguments and risk‑adjusted trade decisions, achieving superior Sharpe ratio and drawdown performance on real‑world stock and ETF tests.

Bighead's Algorithm Notes
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Trading-R1: Open-Source LLM Framework for Explainable Financial Trading

Background

Financial trading demands high explainability and reliability. Traditional time‑series models rely on handcrafted features and lack logical explanations, while generic large language models (LLMs) struggle to convert natural‑language analysis into executable trading decisions.

Challenges

Data integration difficulty: multimodal financial data (news, fundamentals, technical indicators) are noisy and hard for LLMs to fuse.

Decision reliability: hallucination leads to reasoning that deviates from real data.

Sparse training data: public financial data are fragmented and lack structured supervision signals.

Task mismatch: trading is a path‑dependent, uncertain process unlike the mathematical or programming tasks LLMs are usually optimized for.

Problem Definition

The paper asks how to enable LLMs to generate professional, structured investment arguments that incorporate technical analysis, fundamentals, news, etc.; how to translate LLM reasoning into executable trading decisions that adapt to market risk; and how to train LLMs on limited high‑quality financial data for multi‑stage reasoning.

Method

Data Construction (Tauric‑TR1‑DB)

The authors build a dataset of 100 k samples covering 14 large‑cap stocks (e.g., NVDA, AAPL) and 2 ETFs (SPY, QQQ) over 18 months (2024‑01 to 2025‑05). Each sample contains five modalities: technical indicators, fundamentals, news, sentiment, macro‑economic data. Samples are created by time‑bucket splitting (3 days, 4‑10 days, 11‑30 days), noise filtering via LLM relevance scoring, and random multimodal fusion, yielding inputs of 20‑30 k tokens and supervision signals consisting of an investment argument and a trade label.

Data construction diagram
Data construction diagram

Supervised Fine‑Tuning (SFT) – Reverse‑Reasoning Distillation

Proprietary LLMs (e.g., OpenAI o3‑mini) generate final trade suggestions from structured inputs. A lightweight LLM (e.g., GPT‑4.1‑nano) parses these suggestions to infer intermediate reasoning steps (technical contribution, fundamental contribution, etc.). The inferred steps are concatenated with multimodal evidence to form a structured investment argument, which becomes the supervision target for SFT.

SFT diagram
SFT diagram

Reinforcement Learning (RL) – Three‑Stage Curriculum

The model is trained sequentially through three stages:

Stage I (Structure): Reward the model for producing investment arguments with a standardized XML‑like layout (technical, fundamental, news sections).

Stage II (Argument): Reward the inclusion of direct citations from the input data (e.g., “revenue growth 15 %” must cite the earnings report) to reduce hallucination.

Stage III (Decision): Use volatility‑adjusted labels (strong buy, buy, hold, sell, strong sell) derived from multi‑window EMA returns and standardized volatility to reward decisions that align with market performance.

Three‑stage learning diagram
Three‑stage learning diagram

Key Techniques

Volatility‑adjusted label generation computes EMA returns over 3, 7, 15‑day windows, normalizes them with rolling volatility, and maps them to discrete labels using quantile thresholds (85 %, 53 %, 15 %, 3 %). Strategy optimization employs Group Relative Policy Optimization (GRPO), which scores new versus old policies by group‑wise advantage rather than a value function. The objective includes a KL‑penalty term β.

GRPO objective diagram
GRPO objective diagram

Experiments

Setup

Training uses the 100 k‑sample Tauric‑TR1‑DB. Testing covers six assets (AAPL, NVDA, AMZN, etc.) and two ETFs (SPY, QQQ) from 2024‑06 to 2024‑08. Baselines include small models (Qwen‑4B, GPT‑4.1‑nano), large models (GPT‑4.1, LLaMA‑3.3), and RL models (DeepSeek, O3‑mini). Evaluation metrics are cumulative return (CR), Sharpe ratio (SR), hit rate (HR), and maximum drawdown (MDD).

Overall Performance

Trading‑R1 achieves the highest Sharpe ratio and the lowest maximum drawdown across all assets, outperforming both open‑source and proprietary instruction‑following or inference‑only models.

Performance chart
Performance chart

Ablation Study

Removing SFT yields HR = 62.5 % (NVDA) and MDD = 2.73 %; removing RL yields HR = 45.7 % (NVDA) and MDD = 1.66 %; the full Trading‑R1 obtains HR = 70.0 % with MDD = 3.80 %, demonstrating the complementary benefit of structure and decision learning.

Ablation results
Ablation results

Interpretability

The generated arguments cite concrete evidence such as MACD crossovers, a gross margin of 68.7 %, or Azure cloud growth, providing traceable reasoning for each trade decision.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LLMmultimodalreinforcement learningExplainabilityDatasetFinancial TradingTrading-R1
Bighead's Algorithm Notes
Written by

Bighead's Algorithm Notes

Focused on AI applications in the fintech sector

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.