How Do LLM Trading Agents Perform in a Competitive Market Arena?
The paper introduces Agent Market Arena (AMA), a lifelong, real‑time benchmark that evaluates diverse LLM‑based trading agents across crypto and equity markets, revealing that agent architecture, rather than the underlying LLM, drives performance differences and risk‑adjusted returns.
Background – Existing evaluations of LLM‑based trading agents focus on the model itself, use short test periods, single assets, and unreliable data sources, leaving real‑time market reasoning and adaptability unassessed.
Problem definition – The authors aim to design a reproducible, multi‑asset, real‑time evaluation framework, quantify the impact of agent architecture versus LLM backbone, and compare risk‑style agents under dynamic market conditions.
Method – AMA consists of three core components:
Market Intelligence Stream (MIS) : aggregates real‑time data from OpenAI Web Search, Finnhub, NewsData, yfinance, CryptoNews, Binance, etc., and uses GPT‑5‑nano to de‑duplicate and summarize news, with expert validation achieving 87.5% date accuracy, 92.5% coverage, and 100% bias‑free scores.
Agent Execution Protocol (AEP) : standardizes daily inputs (asset IDs, historical prices, news summaries, trade metadata) and outputs discrete actions (BUY/SELL/HOLD) with fixed generation parameters (temperature 0.5, retries 3). Agent designs include InvestorAgent (baseline with memory), TradeAgent (multi‑expert), HedgeFundAgent (role‑based), and DeepFundAgent (memory‑adaptive).
Performance Analytics Interface (PAI) : visualizes daily, cumulative, annualized returns, volatility, Sharpe ratio, and max drawdown, supporting multi‑dimensional filtering by agent, asset, model, and strategy.
Experiment setup – Five LLM backbones (GPT‑4o, GPT‑4.1, Claude‑3.5‑haiku, Claude‑sonnet‑4, Gemini‑2.0‑flash) were tested on crypto (BTC, ETH) and stocks (TSLA, BMRN) from 2025‑05‑01 to 2025‑07‑31 (warm‑up) and evaluated live from 2025‑08‑01 to 2025‑09‑30.
Results
RQ1 – Profitability : Most agents generated positive returns; InvestorAgent (GPT‑4.1) on TSLA achieved 40.83% cumulative return with a Sharpe ratio of 6.47, while DeepFundAgent on BTC reached a Sharpe of 2.45, showing memory‑based adaptation improves volatility handling.
RQ2 – Architecture vs. model impact : Switching agents caused performance swings of up to 78% (e.g., GPT‑4.1 across agents from –38.72% to 40.83% cumulative return), whereas changing the LLM altered Sharpe ratios by less than 1.5, indicating architecture dominates.
RQ3 – Decision making : Case study on BTC demonstrated TradeAgent correctly hedged during a global market rally on 2025‑08‑13 and sold on 2025‑08‑28 despite long‑term positive news, highlighting dynamic signal integration.
RQ4 – Risk‑style differences : HedgeFundAgent (aggressive) yielded 39.66% return on ETH but with 638.04% annualized volatility; DeepFundAgent (conservative) maintained steady Sharpe ratios (1.96 on BMRN, 2.45 on ETH) by frequent HOLD actions.
Conclusion – AMA provides a rigorous, reproducible platform for continuous assessment of LLM trading agents, showing that agent design choices outweigh LLM selection in determining market performance.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
