Technical Strategies for Ensuring System Stability During the 618 Promotion
The article analyzes the importance of the 618 sales event, identifies factors that threaten system stability such as traffic spikes, massive data, complex scenarios, long delivery chains and low tolerance, and proposes comprehensive application, storage, and operational measures—including unitization, monitoring, logging, fast‑fail, rate‑limiting, degradation, database and cache designs, and emergency processes—to guarantee reliable service during the promotion.
The 618 promotion is a major sales event, accounting for roughly 10% of annual GMV, with 2022 reaching 3,793 billion CNY and an average of 1.463 million CNY per minute, making system stability critical.
Challenges
Traffic volume surges to many times normal levels, turning small issues into large failures.
Massive data volume (e.g., 3.4 trillion CNY of orders in 2022) makes simple queries difficult.
Complex promotional scenarios create high‑load processing pipelines.
Long delivery chains with many dependent services reduce overall availability.
Low user tolerance for errors demands rapid response.
Stability Requirements vs. Regular High‑Availability
Time‑critical: stability must be ensured within a short window, leaving little time for deep debugging.
Perspective shift: focus on overall business impact rather than individual service response.
Higher‑level metrics: stability builds on high‑availability foundations and adds operational safeguards.
Application‑Level Measures
Unitization : Deploy applications as independent units to isolate failures, simplify troubleshooting, and enable independent scaling.
Monitoring & Alerting : Implement multi‑layer monitoring (middleware, RPC, method, machine, system, business, process, dashboard) with appropriate granularity, sensitivity, coverage, and accuracy; define clear alert handling procedures.
Log Management : Standardize log format, level, output, archiving, and trace‑ID; suppress duplicate or irrelevant logs to reduce resource consumption.
Fast‑Fail : Configure thread‑pool timeouts, leverage middleware timeout controls, and apply rate‑limiting to fail fast and protect resources.
Rate Limiting : Base limits on system capacity, use per‑service or global strategies, and prioritize critical business traffic.
Business Degradation : Design fallback paths that sacrifice non‑core features to preserve essential functionality during overload.
Storage‑Level Measures
Database : Use master‑slave architecture, read‑write separation, appropriate transaction isolation (e.g., RC for write‑heavy workloads), sharding, and pre‑optimize slow queries.
Cache : Deploy one‑master‑multiple‑slaves, expand capacity by adding shards, enable dual‑read, handle hot keys, and avoid large keys that degrade performance.
Elasticsearch : Run dual clusters for redundancy, monitor and cancel slow requests, throttle write speed, and watch storage usage against watermarks.
Operational Measures
Dedicated War‑Room Team : A cross‑functional group responsible for rapid response, impact control, and process enforcement.
Stress‑Test (War‑Game) : Simulate real traffic with coordinated upstream/downstream involvement to validate monitoring thresholds, rate‑limit values, and scaling plans.
Technical Freeze : Restrict code releases during the promotion; statistics show 70% of incidents stem from deployments.
Daily Inspections & Holiday On‑Call : Complement automated checks with manual availability verification.
Emergency Playbooks : Pre‑defined response procedures for incidents, illustrated in the accompanying diagram.
In summary, the article provides a detailed technical roadmap for preparing the 618 promotion, covering background analysis, challenge identification, and concrete measures across application, storage, and operational dimensions to ensure system stability and business continuity.
JD Retail Technology
Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.