WeChat Moments' Billion-Visit Architecture: Disaster Recovery & Flexible Scaling
The article analyzes WeChat Moments' massive image and video services, detailing its OC/IDC architecture, holiday traffic challenges, software and hardware safeguards, disaster‑recovery mechanisms, retry policies, and a series of flexible strategies—including compression format changes, bitrate reduction, buffer pools, and timeline throttling—to sustain billions of daily accesses.
1. Introduction
WeChat Moments consists of two business lines: image and video. Image traffic generates a huge number of requests and consumes significant compute resources, while video traffic mainly consumes bandwidth. All data is stored permanently, and rapid business growth continuously increases storage capacity, bandwidth, and device consumption, especially during major holidays.
2. Holiday Technical Guarantees
Technical guarantees during peak holiday periods focus on three aspects:
Software guarantee – optimizing programs and business‑logic to reduce load.
Hardware guarantee – evaluating and scaling bandwidth and machine capacity.
Flexible measures – adjusting less critical features to protect key functionalities.
3. Software Architecture Guarantee
The overall architecture is illustrated below:
The system is divided into OC (outside‑network independent rooms) and IDC (data centers where data is finally stored). Each IDC contains a full set of interface machines, logical devices, and storage devices to support user upload/download and file persistence. OC points provide external access and form a cache pool; if a local OC cache miss occurs, the request is routed to IDC for retrieval. All OC points have identical functionality, and users are directed to the nearest OC. If an OC fails, retries or switches to other OC points ensure successful downloads.
4. Disaster Recovery and Retry Mechanism
Disaster recovery aims to automatically exclude faulty machines. The master server maintains an IP list; heartbeat detection identifies abnormal devices and masks their IPs from the frontend.
Example of front‑layer single‑machine exclusion:
If an entire OC or IDC fails, recovery typically relies on manual operator switches or inter‑module retry mechanisms.
Retry policy: by default, each request is retried up to two times after failure, and retries are directed to geographically distinct nodes. The master returns at least two groups of IPs, ensuring cross‑region retry capability. Because retries increase request volume, they can be disabled during peak periods via master routing, or manually turned off by on‑call staff when IDC failure rates exceed 20%.
5. Hardware Guarantee
5.1 Capacity Evaluation and Expansion
Before major holidays, operations teams assess resource groups and expand capacity based on business budget, growth forecasts, and actual load. Requests exceeding the budget are handled by flexible or overload strategies.
Evaluation methods:
Data‑center capacity is evaluated based on the upper limit of switch bandwidth.
Access‑layer capacity is evaluated using CPU, memory load ratios, and network‑card traffic/packet ratios.
Storage‑layer capacity is evaluated using CPU, memory load ratios, and disk I/O read/write counts.
5.2 Spring Festival Upload Load
The business requires a 9× increase in upload capacity and a 1× increase in download capacity during the Spring Festival. After budget‑driven expansion, most modules still cannot support this growth, especially the compress module, which would need massive VM expansion for each additional load factor.
6. Flexible Strategy Overview
WeChat Moments employs a two‑layer flexible strategy:
Coarse‑grained flexibility: Directly limit upload/download requests by proportion; excess requests return failure, similar to WeChat C2C, used for rapid recovery when the system is overloaded.
Fine‑grained flexibility: Reduce image/video quality, delay user updates, and other business‑level adjustments to lower system load.
7. Flexible Practice – Compression Module
The compress module transforms uploaded raw images into various formats and sizes. Switching from HEVC back to JPEG reduces CPU load by ~80% (from 100% to 20%), supporting a 5× growth, but increases average image size and download bandwidth. To offset this, image quality is reduced from 70 to 50, which minimally impacts user perception during short‑term holiday activation.
8. Flexible Practice – Video Bitrate Reduction
Typical video bandwidth exceeds 1 TB; during holidays, bitrate is reduced from 1800 kbps to 1200 kbps, shrinking average file size from 2.1 MB to 1.3 MB. Tests show negligible impact on user experience, though the effect on download traffic materializes after about four hours, so the change must be applied before the holiday.
9. Flexible Practice – Upload TSSD Buffer Pools
Two TSSD buffer pools are added to absorb overload:
Buffer Pool 1 (zone module): When the zone module is overloaded, excess upload requests are written to this pool. Files in this pool cannot be downloaded directly; they are slowly flushed to downstream modules, reducing short‑term upload spikes.
Buffer Pool 2 (pre‑upload module): The pre‑upload module limits write requests to the storage TFS. If the request rate exceeds TFS capacity, excess requests are stored in this pool. During download, the system checks whether a file resides in the pool and fetches it accordingly. When the pool is retired, its files must be manually migrated back to TFS.
10. Flexible Practice – Timeline Proportion Flex
The timeline (friend‑circle update timestamp) can be cached and not pushed to users, preventing download of new images/videos and thus reducing bandwidth.
Potential issues: users may complain about missing content; overly long cache durations can cause delayed updates and subsequent traffic spikes.
11. Spring Festival Manual Flex Steps
These steps outline the operational procedures for activating the above flexible measures during the Spring Festival.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
IT Architects Alliance
Discussion and exchange on system, internet, large‑scale distributed, high‑availability, and high‑performance architectures, as well as big data, machine learning, AI, and architecture adjustments with internet technologies. Includes real‑world large‑scale architecture case studies. Open to architects who have ideas and enjoy sharing.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
