How JD’s TimeHF Billion‑Scale Time‑Series Model Boosts Forecast Accuracy with RLHF
JD’s supply‑chain algorithm team introduces TimeHF, a billion‑scale time‑series large model that leverages RLHF to improve demand forecasting accuracy by over 10%, detailing dataset creation, the PCTLM architecture, a custom RL framework (TPO), and superior benchmark results.
Introduction
Time‑series forecasting is a core technology for supply‑chain management, energy scheduling, and financial risk control. Traditional methods such as ARIMA, Prophet, LSTM, and TCN struggle with complex patterns, long‑range dependencies, and zero‑shot generalization across diverse product categories.
Challenges in Existing Approaches
Insufficient pattern capture: Linear assumptions and local modeling miss long‑term and multivariate interactions.
Weak zero‑sample generalization: Supervised models need retraining for each new scenario.
Poor dataset quality: Public time‑series datasets are limited in size, diversity, and scalability, preventing large‑scale models from demonstrating scaling laws.
Lack of effective RLHF for time series: Existing RLHF frameworks for LLMs do not suit continuous‑value forecasting models.
Dataset Construction (2.1)
The team assembled a 1.5‑billion‑sample high‑quality dataset by mixing JD’s internal sales data, public datasets (Monash, TSLib), and synthetic data. The pipeline includes data cleaning, labeling (length, average sales, zero‑sale ratio), quality filtering, deduplication, diversity ranking, and controlled data‑type ratios (≈76% JD data, 20% synthetic, 4% public).
Model Design – PCTLM (2.2)
PCTLM (Patch Convolutional Time‑Series Large Model) partitions the input series into overlapping patches, projects each patch into a vector, and processes them with a convolution‑based encoder that captures cross‑patch information. A grouped attention mechanism with time‑position encoding (ROPE) reduces computational cost while preserving temporal relationships.
Training Scheme – RLHF for Time Series (2.3)
Standard RLHF frameworks (PPO, RLOO) cannot be applied directly because time‑series models output deterministic values and lack probability distributions. The team created TPO (Timeseries Policy Optimization), a RLHF pipeline that:
Adds paired “good‑vs‑bad” predictions to the RLHF dataset.
Introduces a probability‑output component that models predictions as Gaussian N(μ,1), enabling KL‑divergence computation.
Uses a REINFORCE‑style advantage function based on baseline reward differences, avoiding TD‑error.
Combines a pre‑training loss with MSE to retain forecasting performance while preventing over‑fitting during fine‑tuning.
Experimental Results (3)
On public benchmarks, the PCTLM model fine‑tuned with SFT + TPO outperformed GPT‑4TS and five strong baselines (PatchTST, Autoformer, iTransformer, DLinear, Informer) in MAE, achieving state‑of‑the‑art performance across most datasets.
Conclusion
The authors present a complete pipeline—PCTLM + SFT + TPO—for training billion‑scale time‑series models. The PCTLM architecture is the first pure time‑series model exceeding one billion parameters, delivering zero‑shot performance superior to GPT‑4TS and traditional supervised models. The custom RLHF framework TPO further improves forecasting accuracy and has already been deployed in JD’s supply‑chain system, automating replenishment for 20,000 SKUs with a notable boost in prediction accuracy.
For more technical details, see the paper “TimeHF: Billion‑Scale Time Series Models Guided by Human Feedback” (https://arxiv.org/abs/2501.15942).
JD Retail Technology
Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
