Boosting Mega‑Sale Stability: Suning’s Backend Data Components in Action
The article details how Suning’s transaction middle‑platform leverages custom TPS collection, advanced flow‑control, big‑data analytics, and AI‑driven forecasting to ensure system stability, capacity planning, and intelligent inventory distribution during the high‑traffic 818 promotional event.
In 2023, the 818 shopping festival coincided with Suning's 30th anniversary, prompting the launch of a new "Focus on Good Service" brand. The massive traffic surge created significant challenges for Suning's middle‑platform stability, requiring rapid, efficient, and intelligent support during the promotion.
Promotion Guarantee Pain Points
Large‑scale promotions test e‑commerce backend capabilities. Sudden spikes in consumer traffic increase system pressure, demanding a shift from ordinary daily‑business designs to high‑concurrency architectures, making stability, peak‑traffic handling, and rapid data aggregation new pain points.
Data Components Supporting Promotion Guarantee
TPS Collection Component
The RSF platform only provided minute‑level call statistics, lacking second‑level precision, and its raw logs were sampled, making them unusable. Precise second‑level monitoring became essential for capacity planning, real‑time monitoring, and flow control.
Suning developed a TPS collection component that, to date, has been integrated into nearly 30 systems and over 200 core services. A unified portal displays real‑time and historical call volumes, greatly improving link analysis efficiency during stress tests and promotions. The collected data serves as fundamental material for performance analysis, capacity planning, and resource reservation, allowing business teams to visualize service pressure directly.
Solution:
Use Spring AOP to weave aspects; business systems add an annotation and enable the SCM switch. Each JVM records per‑second call counts.
Flume collects logs in real time.
Flink aggregates per‑second call volumes across single‑ and multi‑data‑center deployments.
Aggregated data is stored in a database.
The portal displays the metrics live.
Flow‑Control Component
Flow control is a critical safeguard. Two technical approaches exist: concurrency‑based flow control (used by RSF) and QPS‑based flow control, which rejects excess traffic to protect the system. Suning built a custom flow‑control component to address the limitations of the former.
Distributed Flow‑Control
Each JVM receives a flow‑control threshold, enabling per‑JVM traffic regulation. The overall service capacity is calculated and divided among instances, resulting in minimal overhead.
Solution:
The control platform distributes thresholds to each JVM.
Requests are aggregated per second to check against the threshold.
Excess requests are throttled and logged.
Flume gathers flow‑control logs.
Flink aggregates logs across data centers and stores them.
The portal visualizes flow‑control data.
Distributed Hot/Cold Bucket Flow‑Control
This mechanism prevents popular‑item spikes (e.g., flash sales) from blocking normal traffic. By automatically distinguishing hot items, separate thresholds for hot and cold buckets are applied.
Solution:
Control platform pushes thresholds to each JVM.
Requests are classified as hot or cold (auto or fixed).
Hot and cold requests are aggregated per second and compared against their respective thresholds.
Excess traffic is throttled and logged.
Flume collects the logs.
Flink aggregates and stores the counts.
The portal displays the results.
Global Flow‑Control
Designed for scenarios requiring fine‑grained control (e.g., order submission). A global threshold stored in Redis is fetched by each JVM, which then deducts a step value locally to enforce flow control.
Solution:
Control platform distributes global thresholds and step sizes.
Each request checks local quota; if insufficient, it pulls a step from the shared pool.
When the shared pool is exhausted, excess traffic is throttled and logged.
Flume gathers logs.
Flink aggregates across data centers and stores the data.
The portal visualizes the flow‑control metrics.
Flow‑Control Comparison
Distributed, global, and hot/cold flow‑control each have advantages and drawbacks. Different business scenarios dictate the appropriate solution. Future work aims to create a dynamic flow‑control system that can switch between distributed and global modes, dynamically adjust hot/cold thresholds, and adapt based on health‑data analysis.
Middle‑Platform Data Cloud Supporting Promotion Guarantee
Beyond component upgrades, big‑data and AI capabilities have become essential for promotion stability. Suning’s three‑layer architecture—collection, cleaning, and data‑cloud platforms—provides unified metric services, OLAP‑based business indicators, multi‑dimensional queries, and archiving, delivering new breakthroughs for promotion preparation.
Cart‑Add Analysis
The shopping cart is a critical link; promotion peaks stem from cart order submissions. Real‑time binlog collection aggregates cart‑add events, which are cleaned and fused on the data‑cloud platform. Intelligent algorithms analyze over 400 million add‑to‑cart records and 70 k promotional activities, delivering the top‑100 hot‑add items and top‑10 promotions for pre‑caching, reducing risk.
Higher add‑to‑cart counts indicate product popularity; targeted promotional interventions on these items improve conversion rates.
Intelligent Multi‑Platform Allocation
Suning sells across its own platforms, O2O stores, offline outlets, and third‑party channels (Tmall, Douyin, Dalingjia, Meituan, Ele.me). The data‑cloud’s intelligent allocation engine processes tens of thousands of SKUs across 15 categories, dramatically reducing out‑of‑stock rates and boosting 818 sales.
Solution:
Leverage the data‑cloud and AI‑ML platform to forecast sales per platform using historical sales, promotions, and inventory depth.
The inventory center applies allocation rules based on forecasts, distributing stock to each platform.
Real‑time monitoring adjusts allocations dynamically, with anomaly alerts guiding operations.
Forecasting combines three stages: time‑series analysis for stable patterns, machine‑learning models for multi‑variable scenarios, and deep‑learning models for complex allocation patterns, maximizing prediction accuracy.
Stress Testing and Traffic Modeling
Each promotion requires two key steps: estimating traffic budgets for capacity assessment and executing stress‑test plans to verify that systems meet target thresholds.
Suning’s middle‑platform, with over 30 systems, builds seven core traffic models based on historical data and key service call chains. These models enable rapid capacity planning and are validated through multi‑round full‑link stress tests.
Solution:
Map top‑down service call relationships.
Collect TPS data from all services via big‑data pipelines to build traffic models.
Apply machine‑learning‑driven traffic forecasts to guide capacity planning, stress testing, and flow‑control configuration.
Promotion System Health Inspection
Traditional monitoring only alerts after risks occur. Suning’s data‑cloud automates health checks—cache usage, hit rates, disk I/O, TopSQL, etc.—across thousands of machines, reducing manual inspection from a week to a single day.
Solution:
Trigger inspections via TPS anomalies, flow‑control alerts, or manual initiation, with rule‑engine validation.
Integrate data from Zabbix, DBMS, and other monitoring platforms to gather health metrics.
Generate risk assessments and reports; if anomalies exist, apply AI‑based knowledge‑graph root‑cause analysis.
Future – Intelligent Diagnosis
Even senior engineers cannot pinpoint the root cause of an issue within a minute. For promotions, rapid diagnosis is critical. Suning plans a three‑layer intelligent diagnosis model: host‑level, persistence‑level (database), and business‑level analysis, leveraging cloud‑computing R&D, knowledge graphs, and AI to quickly locate the underlying problem.
Overall, promotion guarantee is a complex, systematic engineering effort. By continuously summarizing experiences, building regular preparation mechanisms, and developing layered monitoring tools, Suning moves toward increasingly systematic, refined, and intelligent assurance of stable system operation and smooth user experience.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Suning Technology
Official Suning Technology account. Explains cutting-edge retail technology and shares Suning's tech practices.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
