Operations 25 min read

WeChat’s 900M MAU Scaling: Secrets of Efficient Operations

The talk outlines WeChat’s approach to handling rapid user growth through disciplined operational standards, cloud‑native management, precise capacity planning, and automated scaling, detailing configuration file conventions, name‑service design, hardware metric evaluation, stress‑testing methods, and dynamic resource allocation to maintain high efficiency and low cost.

Efficient Ops

Oct 10, 2017

WeChat’s 900M MAU Scaling: Secrets of Efficient Operations

1. Operational Standards

When business volume grows quickly, efficiency is the primary concern; later, cost becomes the focus. The operational standards are divided into four parts: operational norms, cloud‑based management, capacity management, and automatic scheduling.

Operational norms

Cloud‑based management

Capacity management

Automatic scheduling

1.1 Configuration File Standards

Configuration files are standardized in directory structure, cross‑service shared items, per‑instance differences, and environment differences (dev/test/production). The goal is that the MD5 of configuration files remains identical across all environments, enabling seamless deployment without manual scripts.

All instances of the same service version have identical configuration file MD5 across environments.

1.2 Name Service Standards

The name service is organized into three layers: access layer (LVS‑like), logic layer (etcd‑like), and storage layer (automated routing). Service scaling is treated as an operations task independent of development releases.

1.3 Data Storage Standards

Access layer: no data.

Logic layer: short‑lived cache, static data only, no dynamic data persistence.

Storage layer: long‑lived cache, Paxos‑based data persistence.

Scaling the access and logic layers does not require data migration or cache‑hit considerations.

2. Cloud‑Based Management

We migrated the logic layer to a private cloud because of the massive number of micro‑services (≈5,000) and resource contention on physical machines.

2.1 Why Move to Cloud

Resource contention among thousands of services on the same host drove the decision to adopt cloud techniques.

2.2 What Parts Are Cloud‑ified

Access layer: dedicated physical machines, ample capacity, few changes – not cloud‑ified.

Logic layer: mixed deployment, unpredictable capacity, frequent changes – cloud‑ified.

Storage layer: dedicated machines, controllable capacity, few changes – not cloud‑ified.

2.3 Cgroup‑Based Cloud Implementation

We use kernel Cgroup to create lightweight VM‑like slices (e.g., 1 CPU + 1 GB, 2 CPU + 4 GB) and physical machine partitioning.

2.4 Why Docker Is Not Used

Our svrkit framework covers the entire fleet and relies heavily on IPC, making Docker’s intrusion undesirable.

Docker’s process restart behavior could disrupt services.

We prefer a non‑invasive, self‑developed solution.

2.5 Private Cloud Scheduling System

We built a private cloud scheduler inspired by Borg, Yarn, Kubernetes, and Mesos, covering about 80 % of micro‑services.

2.6 Cloud Management Summary

Goal: service‑level resource isolation.

Goal: page‑based service scaling.

Measure: deployment system blocks non‑cloud services; core services are actively migrated.

3. Capacity Management

3.1 Supporting Business Growth

Capacity should match business growth curves; frequent scaling keeps capacity aligned with demand.

3.2 Evaluating Capacity with Hardware Metrics

CPU, memory, disk, and network usage are primary indicators, though they have limitations.

3.3 CPU‑Based Capacity Formula

Service capacity = current peak / empirical CPU ceiling

3.4 Limitations of Hardware Metrics

Different services are constrained by different resources, and performance near critical thresholds can be unpredictable.

3.5 Stress‑Testing Methods

Simulated traffic in test environment.

Simulated traffic in production (full‑link testing).

Real traffic in test environment (bypassing to test storage).

Real traffic in production.

3.6 Online Stress‑Testing

Adjust service weight dynamically, monitor queue latency, and throttle the test rate to obtain a precise performance model within seconds.

3.7 Self‑Protection and Upstream Retry

Services can quickly reject excess requests; upstream services have retry protection to route traffic to healthy instances.

3.8 Multi‑Dimensional Monitoring

Monitoring includes hardware metrics, fast‑reject alerts, latency for front‑end and back‑end, and failure detection across the entire call chain.

3.9 Second‑Level Monitoring

Metrics are collected every 6 seconds, enabling anomaly detection within ten seconds.

3.10 Dynamic Rate Control

Queue backlog drives the test rate: fast when idle, slowed down as backlog appears, yielding a stable capacity curve.

3.11 Capacity Management Summary

Accurate quantification of service resource needs.

Identification of optimal machine types for each micro‑service.

4. Automatic Scheduling

4.1 Automatic Scaling for Business Growth

Using performance curves and traffic forecasts, services are kept at 50‑60 % utilization with a 66 % safety margin, allowing graceful scaling.

4.2 Automatic Scaling for Anomalies

Sudden traffic spikes trigger CPU‑based auto‑scale.

Program performance regressions are detected via frequent stress‑tests.

4.3 Performance Management Loop

New versions are stress‑tested in gray release; regressions are blocked before full rollout.

4.4 Peak‑Shaving and Valley‑Filling

High‑peak services release resources after peak; low‑peak services acquire idle resources, smoothing overall load.

4.5 Offline Computing to Fill Valleys

Offline tasks run during low‑traffic windows (01:00‑08:00 unrestricted, 08:00‑20:00 throttled) with strict Cgroup limits and lowest priority.

4.6 Automatic Scheduling Summary

Full control of online services to maximize resource utilization.

Offline tasks share online CPU and memory without separate provisioning.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Computing Automation Operations capacity management

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.