Artificial Intelligence 12 min read

How TS‑Agent Uses LLMs and Reflective Feedback to Automate Financial Time‑Series Modeling

TS‑Agent is a modular LLM‑driven framework that formalizes financial time‑series modeling as a three‑stage iterative decision process, leveraging structured knowledge bases, dynamic memory, and a feedback‑driven code‑editing loop to outperform AutoML baselines in accuracy, robustness, and auditability.

Bighead's Algorithm Notes

Oct 14, 2025

How TS‑Agent Uses LLMs and Reflective Feedback to Automate Financial Time‑Series Modeling

Background – Financial markets generate massive time‑series data, yet building high‑performance, interpretable, and auditable models remains difficult. Conventional AutoML tools (e.g., AutoGluon, H2O AutoML) simplify model development but rely on static rule‑based model selection and generic statistical objectives, ignoring finance‑specific metrics such as risk‑adjusted returns.

Problem definition – The paper seeks a robust, adaptive, and auditable workflow for financial time‑series modeling that satisfies three requirements: (1) domain adaptability to handle high‑frequency sparsity and heavy‑tailed returns; (2) dynamic adaptation through experimental feedback; (3) transparent auditability of model‑selection, code‑modification, and hyper‑parameter decisions.

Method overview – TS‑Agent is a modular intelligent framework that structures the workflow into three iterative stages—model pre‑selection, code optimization, and fine‑tuning—guided by contextual reasoning and experimental feedback. The core architecture includes three read‑only knowledge bases: a Case Bank (historical financial modeling tasks), a Refinement Knowledge Bank (expert heuristics for preprocessing, training, and evaluation), and a Code Base (model implementations and metric definitions). Dynamic memory M_t records experiment logs and code snapshots, while context C_t aggregates memory, knowledge bases, and task description to inform decisions.

Feedback loop and code‑editing chain – Decision making is expressed as a probabilistic decomposition over four actions: A_{model} (model selection), A_{fine‑tune} (hyper‑parameter tuning), A_{refinement} (code optimization), and A_{logging} (recording experiment outcomes for traceability).

Two‑stage workflow – Stage 1: Model pre‑selection uses case‑based reasoning on the Case Bank to shortlist candidates (e.g., Autoformer, PatchTST). Stage 2: Code optimization employs a round‑robin search: a warm‑up phase runs a few optimization‑tuning loops for each candidate and selects the best performing combination; an optimization phase then iterates further to refine performance.

Experiments – The authors evaluate TS‑Agent on prediction and synthetic‑generation tasks using three datasets: Crypto (hourly prices of 20 cryptocurrencies, 2024), Exchange (daily rates of 8 major currency pairs, 1990‑2010), and Stock (daily closing prices of 10 US stocks, 2020‑2024). Baselines include AutoGluon (prediction), Optuna (generation), DS‑Agent, and ResearchAgent. LLM backbones are GPT‑3.5, GPT‑4o (OpenAI), Claude Sonnet 4 (Anthropic), and Nova Pro (Amazon). Metrics cover standard statistical errors (RMSE, MAE, MAPE, sMAPE), finance‑specific measures (Sharpe ratio, VaR, ES), and success‑rate across multiple runs.

Prediction results – TS‑Agent consistently outperforms baselines. On the Crypto set, TS‑Agent (GPT‑4o) achieves RMSE 0.206, a 30 % reduction versus DS‑Agent (0.297) and 7.6 % versus AutoGluon (0.223). Financial robustness improves as well: Sharpe difference +0.3778, VaR difference 0.00125, ES difference 0.0302, with success rates of 80‑100 % (Crypto 100 %).

Generation results – For synthetic data, TS‑Agent yields higher statistical fidelity (marginal‑distribution, correlation, autocorrelation scores) and lower VaR/ES differences. Its metric variance across LLMs (e.g., Stock correlation score range 1.194‑3.468) is far smaller than DS‑Agent (5.282‑11.991) and ResearchAgent (7.175‑12.879), attributed to the structured model library and modular code templates that reduce randomness.

Case study – Predicting the next three days of closing prices for ten US stocks (input = 60 days) proceeds as follows: (1) Model pre‑selection picks Autoformer and PatchTST from the Case Bank; (2) Warm‑up logs MAPE 3.41 for Autoformer and 4.17 for PatchTST; (3) Optimization iterates on Autoformer, reducing final MAPE to 1.86; (4) Every code change (e.g., learning‑rate schedule adjustment) and its justification are logged, enabling full auditability.