Artificial Intelligence 10 min read

How TSci Uses LLMs to Automate End‑to‑End Time‑Series Forecasting

The article reviews the TSci framework, an LLM‑driven multi‑agent system that automates data diagnosis, model selection, ensemble forecasting, and report generation for time‑series prediction, achieving up to 38 % lower MAE than LLM baselines and improving report quality across five evaluation dimensions.

Bighead's Algorithm Notes

Nov 30, 2025

How TSci Uses LLMs to Automate End‑to‑End Time‑Series Forecasting

Background

Time‑series forecasting is essential for decision‑making in energy, finance, climate, and public health, but practical use suffers from short series, noise, heterogeneous sampling frequencies, high preprocessing cost, and poor generalisation of existing statistical and deep‑learning models. TSci introduces the first LLM‑based agent framework to automate the entire forecasting pipeline and provide transparent, "white‑box" reports.

Problem Definition

The goal is to build a general, low‑human‑intervention framework that addresses four core questions: (1) how to automatically diagnose and preprocess heterogeneous time series; (2) how to dynamically select and optimise models based on data characteristics; (3) how to improve robustness through model ensemble; and (4) how to generate transparent reports that enhance interpretability.

Method

TSci is a multi‑agent system comprising four specialised agents that mimic a human scientist’s workflow.

Curator (Data Steward)

The Curator uses LLM‑driven diagnostics and external tools to assess data quality, perform preprocessing, and visualise the series. It computes statistics (mean, std, trend), identifies missing (M) and outlier (O) values, and formulates processing strategies (π). Visual outputs include overview plots, decomposition (trend/seasonality/residual) charts, and ACF/PACF graphs.

Planner (Model Planner)

Based on the Curator’s analysis summary, the Planner narrows the model configuration space and conducts validation‑driven search. It selects candidate models from a predefined library of 21 models (ARIMA, LSTM, XGBoost, Prophet, etc.) according to data features (e.g., weak trend + long period → Prophet). Hyper‑parameter optimisation samples configurations θ_i on a validation set and retains the configuration with the lowest validation MAPE. The top‑k models are kept for ensemble.

Forecaster (Forecast Executor)

The Forecaster dynamically selects an ensemble strategy based on validation results while avoiding test‑data leakage. Strategies include using a single best model when its advantage is significant, performance‑weighted averaging (inverse loss weighting with shrinkage), and robust aggregation (median or trimmed mean) to balance accuracy and stability.

Reporter (Report Generator)

The Reporter consolidates intermediate analysis and predictions into a structured report containing: integrated forecast with confidence intervals, performance summary (MAE, MAPE) for single and ensemble models, natural‑language explanations of model choices and weight derivations, visualisations (time‑series, decomposition, etc.), and a full decision log that makes each step transparent.

Experiments

Setup

Eight benchmark datasets (ETT, Weather, ECL, Exchange, ILI, etc.) covering energy, environment, economics, and health with frequencies from hourly to weekly and lengths from 966 to 69 680 are used. Baselines include statistical models (ARIMA, Prophet) and LLM baselines (GPT‑4o, Gemini‑2.5 Flash, Qwen‑Plus). Evaluation metrics are MAE and MAPE; report quality is assessed on five dimensions: analysis rigor (AS), model justification (MJ), explanation consistency (IC), actionable quality (AQ), and structural clarity (SC).

Forecasting Performance

TSci achieves an average MAE reduction of 10.4 % over statistical baselines and 38.2 % over LLM baselines. For example, on the ETTh1 dataset TSci records MAE = 2.02 versus the second‑best baseline MAE = 9.16, and on the Exchange dataset MAE drops to 4.50e‑2, more than 60 % lower than the baseline.

Report Quality

TSci‑generated reports win >80 % of the time on analysis rigor (AS) and model justification (MJ), and >75 % on explanation consistency (IC) and actionable quality (AQ), demonstrating strong technical rigour and communication effectiveness.

Ablation Study

Removing any module degrades performance: without Curator preprocessing, average MAE increases by 41.8 %; without the analysis module, MAE rises by 28.3 %; without hyper‑parameter optimisation, MAE grows by 36.2 %, especially on long‑period or high‑variance series.