Paper Review: TradingGroup – A Multi‑Agent Quantitative Trading System with Self‑Reflection and Data Synthesis
The paper introduces TradingGroup, a five‑agent LLM‑based quantitative trading framework that incorporates a self‑reflection mechanism, dynamic risk management, and an automated data‑synthesis pipeline, and demonstrates superior cumulative returns, Sharpe ratios, and lower drawdowns than rule‑based, ML, RL, and existing LLM strategies on five real‑world stock datasets.
Background
Recent advances in large language models (LLMs) have enabled financial‑agent applications such as news‑sentiment analysis, earnings parsing, and stock‑price prediction. Existing systems are limited by weak multi‑agent collaboration, lack of a structured self‑reflection mechanism, and insufficient high‑quality post‑training data that capture market states together with agent decisions.
Problem Definition
The work addresses two core challenges for quantitative‑trading agents:
Self‑reflection and strategy optimization : design a mechanism that aggregates logs, historical performance, and risk signals from multiple agents, extracts good and bad decision patterns, and dynamically optimizes each agent’s workflow.
High‑quality post‑training data generation : build an automated data‑synthesis and annotation pipeline that records market state, agent decisions, and execution results to provide rich samples for LLM fine‑tuning.
Method
TradingGroup consists of five specialized agents, a dynamic risk‑management module, a self‑reflection component, and a data‑synthesis pipeline.
Agent modules
3.1.1 News Sentiment Agent retrieves real‑time news via MCP tools, scores impact with Qwen3‑Reranker‑0.6B, removes duplicates using Qwen3‑Embedding‑0.6B, and outputs a market‑sentiment score.
3.1.2 Earnings Parsing Agent extracts key financial metrics from quarterly/annual reports. After sliding‑window chunking, dense scores from Qwen3‑Embedding‑0.6B (weight w_d=1.0) and sparse scores from BGE‑M3‑0.6B (weight w_s=0.8) are combined to select the top‑10 chunks, which are then re‑ranked by Qwen3‑Reranker‑0.6B.
3.1.3 Stock Prediction Agent combines technical indicators (e.g., RSI‑14, 20‑day SMA deviation) with outputs from other agents. Key calculations include:
RSI‑14 > 70 → overbought; RSI‑14 < 40 → oversold3.1.4 Trade‑Style Adaptation Agent adjusts aggressiveness (aggressive / balanced / conservative) based on account status, historical performance, and multi‑agent analysis. The self‑reflection step analyses past style‑performance pairs and selects the optimal style for the current capital and position.
3.1.5 Decision Agent integrates predictions, sentiment, earnings insights, and style signals with account state to generate final buy/hold/sell actions. It annotates the past 20‑day outcomes, summarizes patterns, and injects them as LLM prompts to correct future errors.
Dynamic risk management
The module adjusts stop‑loss T_{SL} and take‑profit T_{TP} thresholds in real time. Thresholds are computed as<br>
T_{SL}=m_s^{sl}\times\sigma_{d,10},\quad T_{TP}=m_s^{tp}\times\sigma_{d,10}where \sigma_{d,10} is the 10‑day unannualized standard deviation of log returns and m_s^{sl}, m_s^{tp} are style‑specific multipliers. Positions are forcibly closed when unrealized PnL ≤ ‑ T_{SL} or ≥ T_{TP}, with size scaled by the current trade style.
Self‑reflection mechanism
Recent successful and failed cases are extracted from the data pipeline, patterns and root causes are summarized, and the insights are injected into LLM context to guide future predictions, style adaptation, and decision making.
Data synthesis pipeline
The pipeline records each agent’s input/output text, account metadata (date, holdings, cash), and LLM chain‑of‑thought ( CoT). It annotates predictions as correct/incorrect and computes rewards. For the stock‑prediction agent the reward is based on (pct, \varepsilon, p_{true}); for the decision agent the reward uses realized return r_{eq,a}, benchmark return r_{bm}, transaction cost c_a, with coefficients \beta=0.2, \gamma=1.0. Only high‑reward samples are retained for post‑training.
Experiments
4.1 Datasets – Training runs DeepSeek‑R1 as the base LLM to execute TradingGroup over two non‑overlapping windows (2020‑06‑16 to 2021‑08‑16 and 2021‑08‑17 to 2022‑10‑05), collecting 1,080 high‑quality trajectories for distillation. Testing uses the public FINSABER back‑testing data (2022‑10‑06 to 2023‑04‑10) covering five stocks: AMZN, NFLX, TSLA, MSFT, COIN.
4.2 Experimental setup – All baselines share GPT‑4o‑mini as the inference core: rule‑based (buy‑and‑hold, SMA crossover), machine‑learning (ARIMA, XGBoost), reinforcement‑learning (FinRL A2C/PPO), and LLM‑based strategies (FinMem, FinAgent). A PEFT experiment fine‑tunes Qwen3‑8B with LoRA + int8 quantization on the synthesized data, yielding Qwen3‑Trader‑8B‑PEFT.
4.3 Evaluation metrics – Cumulative Return (CR), Sharpe Ratio (SPR), Maximum Drawdown (MDD), Annualized Volatility (AV).
4.4 Results
Framework comparison – TradingGroup (GPT‑4o‑mini) outperforms all baselines on TSLA, AMZN, MSFT, and COIN. Example: AMZN CR = 40.46 % vs. second‑best 13.27 %; MDD = ‑2.118 %; AV = 17.228 %.
Risk control – Enabling the risk‑management module yields a more stable NFLX CR = 20.46 % (vs. 53.24 % without risk control but with higher volatility).
Data‑synthesis + PEFT – Qwen3‑Trader‑8B‑PEFT improves CR on all five stocks (e.g., TSLA CR = 28.67 % vs. 0.073 % for the base model) and surpasses GPT‑4o‑mini on several tickers.
Risk optimization – For MSFT, MDD drops from ‑22.17 % to ‑8.53 % and AV reduces from 41.21 % to 22.96 % after applying the risk module and data synthesis.
Ablation study – Removing self‑reflection (SR) or retrieval‑enhancement (RE) degrades CR (e.g., TSLA CR = 5.276 % → 25.662 %). Disabling risk management (RM) causes CR to plunge (TSLA CR = 25.662 % → ‑14.38 %).
Overall, the integration of self‑reflection, dynamic risk control, and high‑quality synthetic data enables TradingGroup to achieve markedly better profitability and risk profiles than rule‑based, ML, RL, and existing LLM baselines.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
