Engineering Double‑11‑Scale E‑Commerce Events: A Complete Technical Playbook
From kickoff meetings and traffic forecasting to load‑testing strategies, rate‑limiting designs, emergency runbooks, and post‑event retrospectives, this guide walks engineers through the complete technical workflow required to ensure a Double‑11‑scale e‑commerce promotion runs smoothly and safely.
Pre‑event preparation
Kick‑off (KO) meeting : collect promotion background, objectives, activity schedule, participating products, third‑party involvement, expected DAU/DAT, and inventory.
Business‑technical alignment : confirm activity granularity, promotional intensity, and the basis for DAU/DAT estimates with product owners.
Traffic evaluation : measure current daily traffic, DAU, DAT; compare business forecasts with technical capacity; decide whether to optimise, scale out, or plan degradation.
Business mapping : classify services as core, non‑core, or degradable; incorporate new projects since the last promotion; assess impact on critical paths and any performance‑test results.
Dependency mapping : document upstream and downstream service dependencies to obtain a full‑stack view.
Load testing
Two approaches are used:
Test without scaling (generally discouraged because it forces artificial reduction of business metrics).
Test after scaling the system to the target promotion capacity.
Testing must be performed on the production environment; staging or pre‑release clusters give misleading results due to hardware and middleware differences. Load is increased gradually, observing system health at each step, and stopped once the target QPS/TPS is reached. Prolonged high load is avoided to prevent CPU spikes, frequent GC, and user‑experience degradation.
Rate limiting
After load testing, a report quantifies the maximum QPS/TPS each service can sustain (e.g., product‑detail page, order page). Both per‑instance (single‑machine) and cluster‑wide limits are set:
// Example calculation
// Service B has 3 instances, downstream C can handle 100 QPS
// Cluster limit for B = 100 QPS
// Per‑instance limit ≈ 33 QPS (adjusted for traffic skew)Per‑instance limits protect the service itself; cluster limits protect downstream services.
Degradation strategy
Rate limiting safeguards stability, but the primary goal is to keep core flows (product‑detail, checkout) functional. Non‑essential or high‑cost features are degraded during peak load, freeing resources for core transactions. The plan specifies which services to downgrade, timing, owners, and recovery procedures.
Emergency runbooks and drills
Runbooks cover failures such as product‑detail errors, order rendering failures, checkout failures, and fulfillment issues. For internal systems, external dependencies are avoided; if third‑party services are required, traffic‑shaping, flood‑control, and SLA agreements are established. Runbooks are rehearsed via automated platforms or manual drills.
Monitoring
Real‑time monitoring spans the whole stack: promotion dashboard, full‑link tracing, core‑service health, load‑test metrics, rate‑limit status, resource utilisation, gateway performance, and channel fulfilment.
Operational standards
Strict change‑management rules define release, data‑migration, emergency scaling, and hot‑fix procedures. These “promotion change standards” provide clear escalation paths and accountability.
Additional logistics
Weekly pre‑promotion meetings
On‑call scheduling and war‑room setup
Issue escalation procedures
During the promotion
On‑call engineers record timestamps, log incidents, and continuously monitor alerts. Any anomaly triggers the predefined runbook and escalation process. All releases and data changes are frozen for the duration of the promotion to maintain stability.
Post‑event review
A retrospective evaluates technical, product, and business outcomes. Successful tactics are codified; shortcomings drive adjustments to product models, performance optimisations, and architectural changes. Degraded services are restored, and a continuous improvement loop—optimise → load‑test → repeat—is established.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
JavaEdge
First‑line development experience at multiple leading tech firms; now a software architect at a Shanghai state‑owned enterprise and founder of Programming Yanxuan. Nearly 300k followers online; expertise in distributed system design, AIGC application development, and quantitative finance investing.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
