AlphaQuanter: An End‑to‑End Tool‑Orchestrating Agent Using Reinforcement Learning for Stock Trading
AlphaQuanter tackles the three major limitations of existing LLM trading agents by introducing a single‑agent framework that dynamically orchestrates market tools, learns transparent decision policies via reinforcement learning, and achieves state‑of‑the‑art performance on key financial metrics across extensive stock‑level experiments.
Background
Automated trading systems are rapidly evolving, yet current approaches suffer from critical drawbacks: traditional machine‑learning models (e.g., SVM, random forest) simplify the problem to discrete price‑direction prediction and cannot integrate heterogeneous signals; deep reinforcement learning (DRL) optimizes long‑term returns but is a black box; LLM agents show promise but face three challenges—lack of tool orchestration and proactive information acquisition, insufficient decision transparency, and fragile prompt engineering—leading to inefficiency and signal inconsistency.
Problem Definition
The paper aims to resolve these challenges by designing a transparent, auditable, and robust single‑agent framework with three concrete objectives: (1) enable dynamic tool orchestration and proactive information gathering to fill information gaps; (2) optimize the end‑to‑end decision pipeline for greater transparency and explainability; (3) improve policy robustness to reduce dependence on prompt engineering and mitigate signal inconsistency.
Method
AlphaQuanter builds a tool‑enhanced Markov Decision Process (MDP) where the trading task is modeled as a tuple <S, A, T, R>:
State space S : composed of the initial context (stock symbol, date), tool‑call history ( query_history), and tool results ( query_result), i.e., s = {initial_context, query_history, query_result}.
Action space A : includes query actions A_q (invoking market data, fundamentals, sentiment, macro‑indicator tools) and decision actions A_d (BUY/SELL/HOLD).
Transition T : query actions update the state by appending call records and results; decision actions terminate the episode.
Reward R : maximizes cumulative trajectory reward, composed of result reward R_{result}, format reward R_{format}, and tool usage reward R_{tool}.
The cognitive workflow, inspired by the ReAct paradigm, follows a plan‑retrieve‑reason‑decide loop:
Plan : generate an initial trading hypothesis.
Retrieve : fill information gaps via tool calls.
Reason : update beliefs based on retrieved evidence.
Decide : output a final action (BUY/SELL/HOLD) or continue the loop.
Reward design details:
Result component : an exponentially weighted forward return r_t smooths short‑term volatility and emphasizes medium‑term trends; market state (bull, bear, sideways) is derived from r_t and mapped to discrete rewards (e.g., BUY +1.0 in strong bull).
Format component R_{format} : constrains reasoning trace length to a token interval [min_token, max_token], penalizing overshoot.
Tool component R_{tool} : limits the number of tool calls to [min_tool, max_tool] and penalizes invalid calls (e.g., parameter errors).
Four tool categories are supported:
Market data (price, volume, technical indicators).
Fundamental data (financial statements, dividends).
Sentiment data (news, social‑media sentiment).
Macro indicators (CPI, federal‑fund rate, commodity prices).
Experiments
Setup : Five high‑volatility, event‑driven large‑cap stocks (GOOGL, MSFT, META, NVDA, TSLA) are used. Training data span 2022‑09‑01 to 2024‑03‑30 (395 days), validation 2024‑05‑15 to 2024‑11‑14 (128 days), and testing 2025‑01‑01 to 2025‑06‑30 (122 days) to avoid leakage.
Baselines : passive buy‑and‑hold, rule‑based strategies (MACD, ZMR), a multi‑agent LLM framework (TradingAgent), and a zero‑shot AlphaQuanter variant.
Metrics : annualized return (ARR), Sharpe ratio (SR), and maximum drawdown (MDD).
Results :
The single‑agent architecture outperforms multi‑agent setups on ARR and SR for all models except GPT‑4o, confirming its effectiveness for smaller models.
Zero‑shot prompt baselines (excluding GPT‑4o) fail to beat the market, indicating the necessity of RL‑driven decision boundary learning.
AlphaQuanter‑3B and AlphaQuanter‑7B achieve ARR improvements of 6.54% and 18.45% over the strongest baseline; the 7B model leads in three of the five stocks.
Training dynamics : the 7B model enters a policy‑refinement phase after ~200 steps, increasing tool usage and reasoning length, whereas the 3B model converges early with reduced tool calls.
Performance validation : the 7B model shows steadily rising ARR and SR while MDD declines, suggesting enhanced risk control; the 3B model exhibits volatile MDD, reflecting weaker risk awareness.
Tool usage patterns : the 7B model concentrates on a few high‑impact tools (selective invocation), while the 3B model spreads calls across many tools, indicating difficulty distinguishing valuable information.
Heuristic insights : the 7B model primarily relies on trend, momentum, and volume indicators as primary signals; sentiment and macro data serve as secondary cues; low‑frequency fundamental data receive minimal weight.
Ablation study :
Removing the format reward R_{format} drops ARR by 53.2%.
Removing the tool reward R_{tool} drops ARR by 43.0%.
Increasing the threshold theta reduces trading frequency, lowering MDD but sacrificing return; decreasing theta boosts return at the cost of higher risk.
Conclusion
AlphaQuanter demonstrates that a single, tool‑enhanced LLM agent trained with reinforcement learning can achieve state‑of‑the‑art performance on realistic stock‑trading benchmarks while providing transparent, auditable reasoning, thereby addressing the key limitations of existing multi‑agent and black‑box approaches.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
