How AI Powers Large‑Scale Time Series Forecasting and Root‑Cause Analysis

This article describes Suning's AI‑driven end‑to‑end solution for massive time‑series monitoring, anomaly detection, forecasting with DeepAR, MQ‑RNN, MQ‑CNN, ensemble methods, root‑cause localization using Hotspot and Monte‑Carlo Tree Search, and the evolution of its large‑scale log analytics platform.

Suning Technology
Suning Technology
Suning Technology
How AI Powers Large‑Scale Time Series Forecasting and Root‑Cause Analysis

Construction Background

In recent years Suning has built a cloud‑based platform that provides open financial, supply‑chain, and product promotion services for manufacturers, agents, retailers, and end users. As online business expands, the number of systems exceeds 5,000 with over 150,000 services, making monitoring and fault‑tolerance increasingly complex.

To ensure stable operation, Suning proposes a one‑stop solution covering system monitoring, problem localization, real‑time alerts, decision analysis, and automated fault recovery, enhanced by deep‑learning AI that constructs knowledge graphs from monitoring data to pinpoint root causes.

Anomaly Detection Platform

The platform consists of four major modules:

Anomaly detection using deep‑learning probability density estimation and Monte‑Carlo sampling, achieving up to 98.21% alert accuracy.

Metric prediction with a stacking ensemble of DeepAR, MQ‑RNN, and MQ‑CNN, forecasting the next 20 minutes from the latest 400 minutes of data.

Multidimensional analysis that applies HotSpot‑style algorithms to isolate the most influential dimensions of an abnormal KPI.

Custom dashboards that support diverse data sources, unified SQL expressions, and various visualizations (pie, line, bar, swim‑lane charts).

Time‑Series Prediction Methods

DeepAR is a supervised learning algorithm based on recurrent neural networks that produces point and probabilistic forecasts, outperforming traditional ARIMA and exponential smoothing techniques.

MQ‑RNN extends sequence‑to‑sequence models with a global MLP that combines encoder outputs and future features, and a local MLP that shares parameters across horizons to predict quantiles directly.

MQ‑CNN replaces the encoder with dilated CNN layers, adds residual connections, and uses a final 1×1 CNN to generate skip outputs, enabling faster training and handling longer input sequences.

Ensemble methods (FFNN, XGBoost, Random Forest, Ridge Regression, Simple Weighted Average) stack the three base models, standardize their predictions, and select the best meta‑learner based on SMAPE on a validation set.

Root‑Cause Localization

For additive KPIs (e.g., login count, successful payments), the platform uses a Hotspot algorithm to model anomaly propagation across multidimensional attribute combinations, a potential‑score to assess confidence, and Monte‑Carlo Tree Search to explore the massive search space efficiently.

Operational Knowledge Graph

By clustering repetitive alerts using a hardware‑software knowledge graph, the system reduces noise and accelerates root‑cause identification. The workflow includes sample construction, alert data association, time‑slice segmentation, causal discovery (CGNN, PC, GES, LiNGAM), graph analysis, edge‑weight calculation, and weighted path search for probable root causes.

Large‑Scale Log Analysis Platform

The platform supports over 2,500 systems, ingesting more than 50 TB of logs daily (≈1.3 trillion entries) with 4 million writes per second and sub‑second query latency. Five architectural stages were iteratively optimized:

Initial VM‑based Elasticsearch cluster with daily indices.

Introduction of client nodes, removal of co‑located VMs, hourly indices, and field pruning.

Separation of clusters by log type, deployment of physical data nodes, and degradation mechanisms for peak traffic.

Hot/Cold node strategy to migrate historical data and isolate critical exception logs.

Replacement of tribe metadata aggregation with direct connections and Flink‑based log ingestion for fault‑tolerant processing.

Vision

Future work aims to embed more AI components into the monitoring and operations ecosystem, achieving truly unattended, cost‑effective, and high‑efficiency AIOps.

deep learningAnomaly Detectiontime series forecastingknowledge graphroot cause analysisLog AnalyticsEnsemble Learning
Suning Technology
Written by

Suning Technology

Official Suning Technology account. Explains cutting-edge retail technology and shares Suning's tech practices.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.