Operations 14 min read

How WeChat Scales Massive Real-Time Monitoring: Design & Practices

This article details the architecture and practical techniques behind WeChat's large‑scale monitoring system, covering lightweight data collection, classification of real‑time, non‑real‑time and user‑specific metrics, anomaly detection algorithms, automated configuration, and high‑performance storage solutions for billions of events per minute.

Efficient Ops

Feb 5, 2018

How WeChat Scales Massive Real-Time Monitoring: Design & Practices

Introduction

WeChat operates a massive backend system with hundreds of millions of calls per minute, generating billions of monitoring data points that cannot be managed manually, requiring a comprehensive, stable, and fast operations monitoring platform.

The monitoring system provides three core functions: fault alarm, fault analysis and localization, and automated strategies.

1. Lightweight Monitoring Data Collection

Typical data collection involves log extraction, local aggregation, and transmission to a central server. For WeChat, this translates to roughly 200 w/min calls producing 200 billion monitoring records per minute.

Early custom text logs caused high CPU, network, storage, and statistical pressure, prompting a redesign for stable minute‑level and even second‑level monitoring.

Data classification

Custom processing strategies

Data is divided into three categories:

Real‑time fault monitoring analysis; Non‑real‑time statistical data such as business reports; Single‑user anomaly analysis for individual incident investigations.

1.1 Non‑real‑time Data

Users submit a logid and custom fields; data is sent via shared‑memory queues and batch packaging to reduce disk I/O and logging server load. Distributed statistics are applied.

1.2 Single‑User Anomaly Analysis

Data follows a fixed format (logid + server IP, return code, etc.) and is sampled before storage in TDW and a cache for quick queries.

1.3 Real‑time Monitoring Data

Real‑time data accounts for the majority of the 200 billion/min reports and includes backend monitoring, terminal monitoring, and external service monitoring.

1.3.1 Backend Data Monitoring

Four layers are monitored:

Hardware metrics (CPU, memory, I/O, network).

Process status (resource consumption).

Inter‑module call chains for fault localization.

Business indicators.

Data is simplified into a uniform format using IP+Key (later ContainerID+IP+Key) and stored in shared memory as uint32_t[MAX_ID][MAX_KEY] with three reporting modes: increment, set new value, set max value.

Call‑relationship data, the second‑largest data volume, is also stored in shared memory and forwarded for analysis.

1.3.2 Terminal Data Monitoring

Mobile client logs are sampled and batched to minimize impact on devices and backend, with version‑specific sampling strategies.

1.3.3 External Monitoring Service

A cloud‑monitoring‑style service allows merchants and mini‑program developers to configure dimensions and monitoring rules for their external services.

2. Evolution of WeChat Monitoring

2.1 Anomaly Detection

Traditional methods (thresholds, same‑day/week comparison) proved insufficient. Improved algorithms include:

Mean‑square deviation over a month’s same‑time data.

Polynomial fitting for stable curves.

Both have limitations, prompting ongoing research into hybrid approaches.

2.2 Monitoring Configuration

With over 300 k monitoring items, manual configuration is unsustainable. Automated configuration leverages historical data, anomaly samples, and feature extraction to suggest optimal parameters, though it remains a work in progress.

3. Data Storage Design for Massive Monitoring

Storage must support high write throughput (over 200 million records per minute) and fast reads (50 w per minute for 22 days of data). Multi‑dimensional keys (main key + sub key) enable flexible queries across client, server, module, and host dimensions.

A custom time‑series key‑value store keeps keys in memory and distributes data across a cluster using hash(main_key). Prefix matching queries use a modified binary search, achieving >1 million queries per second with caching.

To reduce memory pressure, data is cached for one hour and merged into daily records. An improved design splits key‑id‑value, rotating the id mapping every seven days for better balance.

High availability is ensured via the open‑source phxpaxos protocol, providing strong consistency and multi‑master reads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring Real-time data collection Operations storage Large Scale

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.