Operations 12 min read

How NetEase Guarantees Double 11 Stability: SRE Capacity Planning and Technical Optimization

This article explains how NetEase's SRE team prepares for the massive Double 11 e‑commerce event through systematic capacity planning, data‑driven performance evaluation, coordinated technical optimizations, cross‑team activity assessment, comprehensive stability pre‑plans, and disciplined change execution to prevent system overloads.

Efficient Ops

Aug 26, 2021

How NetEase Guarantees Double 11 Stability: SRE Capacity Planning and Technical Optimization

Double 11 Operations: From Capacity Planning to Stability Practices

The annual Double 11 shopping festival creates massive traffic spikes, turning operational metric pressure into a nightmare for developers and operations. NetEase addresses this by applying systematic SRE practices rather than ad‑hoc fixes.

1. Capacity Planning

Using the Double 11 event as a case study, the SRE team collaborates with product and business teams to calculate realistic QPS targets based on historical data and projected user growth. For example, the payment module normally handles 50 qps per host but must support 2 000 qps during the event.

Instead of the traditional "peak × 5" estimate, the team gathers comprehensive usage statistics, producing a data‑driven capacity model that avoids over‑provisioning.

The resulting performance table (Table 1) shows distinct capacity needs for each module, enabling fine‑grained resource allocation.

2. Technical Optimization

When performance testing uncovers issues across multiple services, the SRE team evaluates impact, difficulty, and coordination cost to propose solutions. Service A may apply rate‑limiting, Service B can cache responses, and Service C might refine locking mechanisms.

The solution matrix (Table 2) lists these alternatives, allowing project managers to balance implementation effort against business priorities.

3. Activity Evaluation

SRE works with operations to model traffic flow, estimate peak users, and align capacity. For a flash‑sale expected to attract 500 k users in 30 seconds, a 50 % click‑through within 5 seconds yields an estimated 5 k concurrent requests at the peak second.

The evaluation model (Figure 3) visualizes the collaboration between SRE and operations, ensuring realistic traffic assumptions.

4. Stability Pre‑Plan

Cross‑team reviews of overload mitigation plans are conducted before the event. The overload diagram (Figure 4) outlines rate‑limit and scaling priorities: rate‑limit order A > B > C and scaling order C > B > A.

5. Change Execution Requirements

Large‑scale activities demand precise change execution and ordering. A sample change schedule (Table 3) shows coordinated steps across subsystems, with intentional intervals to avoid interference.

Adhering to these disciplined processes significantly improves stability and success rates for high‑traffic promotions.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

stability technical optimization Large-Scale Events

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.