How NetEase Guarantees Double 11 Stability: SRE Capacity Planning and Technical Optimization
This article explains how NetEase's SRE team prepares for the massive Double 11 e‑commerce event through systematic capacity planning, data‑driven performance evaluation, coordinated technical optimizations, cross‑team activity assessment, comprehensive stability pre‑plans, and disciplined change execution to prevent system overloads.
Double 11 Operations: From Capacity Planning to Stability Practices
The annual Double 11 shopping festival creates massive traffic spikes, turning operational metric pressure into a nightmare for developers and operations. NetEase addresses this by applying systematic SRE practices rather than ad‑hoc fixes.
1. Capacity Planning
Using the Double 11 event as a case study, the SRE team collaborates with product and business teams to calculate realistic QPS targets based on historical data and projected user growth. For example, the payment module normally handles 50 qps per host but must support 2 000 qps during the event.
Instead of the traditional "peak × 5" estimate, the team gathers comprehensive usage statistics, producing a data‑driven capacity model that avoids over‑provisioning.
The resulting performance table (Table 1) shows distinct capacity needs for each module, enabling fine‑grained resource allocation.
2. Technical Optimization
When performance testing uncovers issues across multiple services, the SRE team evaluates impact, difficulty, and coordination cost to propose solutions. Service A may apply rate‑limiting, Service B can cache responses, and Service C might refine locking mechanisms.
The solution matrix (Table 2) lists these alternatives, allowing project managers to balance implementation effort against business priorities.
3. Activity Evaluation
SRE works with operations to model traffic flow, estimate peak users, and align capacity. For a flash‑sale expected to attract 500 k users in 30 seconds, a 50 % click‑through within 5 seconds yields an estimated 5 k concurrent requests at the peak second.
The evaluation model (Figure 3) visualizes the collaboration between SRE and operations, ensuring realistic traffic assumptions.
4. Stability Pre‑Plan
Cross‑team reviews of overload mitigation plans are conducted before the event. The overload diagram (Figure 4) outlines rate‑limit and scaling priorities: rate‑limit order A > B > C and scaling order C > B > A.
5. Change Execution Requirements
Large‑scale activities demand precise change execution and ordering. A sample change schedule (Table 3) shows coordinated steps across subsystems, with intentional intervals to avoid interference.
Adhering to these disciplined processes significantly improves stability and success rates for high‑traffic promotions.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.