Operations 12 min read

Engineering Double‑11‑Scale E‑Commerce Events: A Complete Technical Playbook

From kickoff meetings and traffic forecasting to load‑testing strategies, rate‑limiting designs, emergency runbooks, and post‑event retrospectives, this guide walks engineers through the complete technical workflow required to ensure a Double‑11‑scale e‑commerce promotion runs smoothly and safely.

JavaEdge

Sep 5, 2022

Engineering Double‑11‑Scale E‑Commerce Events: A Complete Technical Playbook

Pre‑event preparation

Kick‑off (KO) meeting : collect promotion background, objectives, activity schedule, participating products, third‑party involvement, expected DAU/DAT, and inventory.

Business‑technical alignment : confirm activity granularity, promotional intensity, and the basis for DAU/DAT estimates with product owners.

Traffic evaluation : measure current daily traffic, DAU, DAT; compare business forecasts with technical capacity; decide whether to optimise, scale out, or plan degradation.

Business mapping : classify services as core, non‑core, or degradable; incorporate new projects since the last promotion; assess impact on critical paths and any performance‑test results.

Dependency mapping : document upstream and downstream service dependencies to obtain a full‑stack view.

Load testing

Two approaches are used:

Test without scaling (generally discouraged because it forces artificial reduction of business metrics).

Test after scaling the system to the target promotion capacity.

Testing must be performed on the production environment; staging or pre‑release clusters give misleading results due to hardware and middleware differences. Load is increased gradually, observing system health at each step, and stopped once the target QPS/TPS is reached. Prolonged high load is avoided to prevent CPU spikes, frequent GC, and user‑experience degradation.

Rate limiting

After load testing, a report quantifies the maximum QPS/TPS each service can sustain (e.g., product‑detail page, order page). Both per‑instance (single‑machine) and cluster‑wide limits are set:

// Example calculation
// Service B has 3 instances, downstream C can handle 100 QPS
// Cluster limit for B = 100 QPS
// Per‑instance limit ≈ 33 QPS (adjusted for traffic skew)

Per‑instance limits protect the service itself; cluster limits protect downstream services.

Degradation strategy

Rate limiting safeguards stability, but the primary goal is to keep core flows (product‑detail, checkout) functional. Non‑essential or high‑cost features are degraded during peak load, freeing resources for core transactions. The plan specifies which services to downgrade, timing, owners, and recovery procedures.

Emergency runbooks and drills

Runbooks cover failures such as product‑detail errors, order rendering failures, checkout failures, and fulfillment issues. For internal systems, external dependencies are avoided; if third‑party services are required, traffic‑shaping, flood‑control, and SLA agreements are established. Runbooks are rehearsed via automated platforms or manual drills.

Monitoring

Real‑time monitoring spans the whole stack: promotion dashboard, full‑link tracing, core‑service health, load‑test metrics, rate‑limit status, resource utilisation, gateway performance, and channel fulfilment.

Operational standards

Strict change‑management rules define release, data‑migration, emergency scaling, and hot‑fix procedures. These “promotion change standards” provide clear escalation paths and accountability.

Additional logistics

Weekly pre‑promotion meetings

On‑call scheduling and war‑room setup

Issue escalation procedures

During the promotion

On‑call engineers record timestamps, log incidents, and continuously monitor alerts. Any anomaly triggers the predefined runbook and escalation process. All releases and data changes are frozen for the duration of the promotion to maintain stability.

Post‑event review

A retrospective evaluates technical, product, and business outcomes. Successful tactics are codified; shortcomings drive adjustments to product models, performance optimisations, and architectural changes. Degraded services are restored, and a continuous improvement loop—optimise → load‑test → repeat—is established.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

monitoring load testing incident response rate limiting Traffic Engineering

Written by

JavaEdge

First‑line development experience at multiple leading tech firms; now a software architect at a Shanghai state‑owned enterprise and founder of Programming Yanxuan. Nearly 300k followers online; expertise in distributed system design, AIGC application development, and quantitative finance investing.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.