WeChat’s 900M MAU Scaling: Secrets of Efficient Operations
The talk outlines WeChat’s approach to handling rapid user growth through disciplined operational standards, cloud‑native management, precise capacity planning, and automated scaling, detailing configuration file conventions, name‑service design, hardware metric evaluation, stress‑testing methods, and dynamic resource allocation to maintain high efficiency and low cost.
1. Operational Standards
When business volume grows quickly, efficiency is the primary concern; later, cost becomes the focus. The operational standards are divided into four parts: operational norms, cloud‑based management, capacity management, and automatic scheduling.
Operational norms
Cloud‑based management
Capacity management
Automatic scheduling
1.1 Configuration File Standards
Configuration files are standardized in directory structure, cross‑service shared items, per‑instance differences, and environment differences (dev/test/production). The goal is that the MD5 of configuration files remains identical across all environments, enabling seamless deployment without manual scripts.
All instances of the same service version have identical configuration file MD5 across environments.
1.2 Name Service Standards
The name service is organized into three layers: access layer (LVS‑like), logic layer (etcd‑like), and storage layer (automated routing). Service scaling is treated as an operations task independent of development releases.
1.3 Data Storage Standards
Access layer: no data.
Logic layer: short‑lived cache, static data only, no dynamic data persistence.
Storage layer: long‑lived cache, Paxos‑based data persistence.
Scaling the access and logic layers does not require data migration or cache‑hit considerations.
2. Cloud‑Based Management
We migrated the logic layer to a private cloud because of the massive number of micro‑services (≈5,000) and resource contention on physical machines.
2.1 Why Move to Cloud
Resource contention among thousands of services on the same host drove the decision to adopt cloud techniques.
2.2 What Parts Are Cloud‑ified
Access layer: dedicated physical machines, ample capacity, few changes – not cloud‑ified.
Logic layer: mixed deployment, unpredictable capacity, frequent changes – cloud‑ified.
Storage layer: dedicated machines, controllable capacity, few changes – not cloud‑ified.
2.3 Cgroup‑Based Cloud Implementation
We use kernel Cgroup to create lightweight VM‑like slices (e.g., 1 CPU + 1 GB, 2 CPU + 4 GB) and physical machine partitioning.
2.4 Why Docker Is Not Used
Our svrkit framework covers the entire fleet and relies heavily on IPC, making Docker’s intrusion undesirable.
Docker’s process restart behavior could disrupt services.
We prefer a non‑invasive, self‑developed solution.
2.5 Private Cloud Scheduling System
We built a private cloud scheduler inspired by Borg, Yarn, Kubernetes, and Mesos, covering about 80 % of micro‑services.
2.6 Cloud Management Summary
Goal: service‑level resource isolation.
Goal: page‑based service scaling.
Measure: deployment system blocks non‑cloud services; core services are actively migrated.
3. Capacity Management
3.1 Supporting Business Growth
Capacity should match business growth curves; frequent scaling keeps capacity aligned with demand.
3.2 Evaluating Capacity with Hardware Metrics
CPU, memory, disk, and network usage are primary indicators, though they have limitations.
3.3 CPU‑Based Capacity Formula
Service capacity = current peak / empirical CPU ceiling
3.4 Limitations of Hardware Metrics
Different services are constrained by different resources, and performance near critical thresholds can be unpredictable.
3.5 Stress‑Testing Methods
Simulated traffic in test environment.
Simulated traffic in production (full‑link testing).
Real traffic in test environment (bypassing to test storage).
Real traffic in production.
3.6 Online Stress‑Testing
Adjust service weight dynamically, monitor queue latency, and throttle the test rate to obtain a precise performance model within seconds.
3.7 Self‑Protection and Upstream Retry
Services can quickly reject excess requests; upstream services have retry protection to route traffic to healthy instances.
3.8 Multi‑Dimensional Monitoring
Monitoring includes hardware metrics, fast‑reject alerts, latency for front‑end and back‑end, and failure detection across the entire call chain.
3.9 Second‑Level Monitoring
Metrics are collected every 6 seconds, enabling anomaly detection within ten seconds.
3.10 Dynamic Rate Control
Queue backlog drives the test rate: fast when idle, slowed down as backlog appears, yielding a stable capacity curve.
3.11 Capacity Management Summary
Accurate quantification of service resource needs.
Identification of optimal machine types for each micro‑service.
4. Automatic Scheduling
4.1 Automatic Scaling for Business Growth
Using performance curves and traffic forecasts, services are kept at 50‑60 % utilization with a 66 % safety margin, allowing graceful scaling.
4.2 Automatic Scaling for Anomalies
Sudden traffic spikes trigger CPU‑based auto‑scale.
Program performance regressions are detected via frequent stress‑tests.
4.3 Performance Management Loop
New versions are stress‑tested in gray release; regressions are blocked before full rollout.
4.4 Peak‑Shaving and Valley‑Filling
High‑peak services release resources after peak; low‑peak services acquire idle resources, smoothing overall load.
4.5 Offline Computing to Fill Valleys
Offline tasks run during low‑traffic windows (01:00‑08:00 unrestricted, 08:00‑20:00 throttled) with strict Cgroup limits and lowest priority.
4.6 Automatic Scheduling Summary
Full control of online services to maximize resource utilization.
Offline tasks share online CPU and memory without separate provisioning.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
