WhatsApp’s High‑Reliability Architecture for 450 Million Users
This article examines WhatsApp’s high‑reliability architecture that supports 450 million users, detailing its Erlang‑based backend, hardware choices, scaling techniques, performance metrics, monitoring tools, and lessons learned from achieving up to two million concurrent connections on a single server.
Service for 450 Million Users – High‑Reliability Architecture
Information Sources
The complete WhatsApp architecture is not publicly disclosed; the following information is compiled from talks, interviews, and articles that describe fragments of the system, especially the use of Erlang to achieve millions of concurrent connections on a single server.
1. Statistics
450 million active users, the fastest growth to that scale.
32 engineers, each supporting ~14 million active users.
500 billion messages per day across seven platforms.
Zero advertising spend, $8 million investment, hundreds of nodes and thousands of cores, hundreds of TB of RAM.
More than 70 million Erlang messages per second.
In 2011 a single server handled 1 million TCP sessions; by 2012 it handled 2 million, and in 2013 WhatsApp processed 180 billion messages daily.
2. Platform
Backend
Erlang
FreeBSD
Yaws, lighttpd
PHP
Custom BEAM patches (BEAM is the Erlang VM)
Custom XMPP implementation
Frontend
Seven client platforms: iPhone, Android, BlackBerry, Nokia Symbian 360, Nokia S40, Windows Phone, and an unknown platform
SQLite for local storage
3. Hardware
Dual Westmere Hex‑core servers (24 logical CPUs)
100 GB RAM, SSD storage
Dual NICs for public and private networks
4. Product Focus
Message delivery without geographic bias and without charging users.
Privacy: messages are not stored on servers; chat history lives only on the device.
5. General Observations
WhatsApp’s server side is almost entirely implemented in Erlang.
Early servers were based on ejabberd (an open‑source XMPP server written in Erlang) and were heavily customized.
Scaling to 500 billion daily messages required a focus on reliability rather than monetization.
System health is monitored via queue lengths; alerts trigger when thresholds are crossed.
Multimedia messages are uploaded to an HTTP server and referenced by URL and Base64 thumbnail.
Erlang’s hot‑code loading enables rapid feature rollout without restarts.
SSL sockets are used; messages are queued until the client reconnects to retrieve them.
Registration relies on phone numbers and a PIN‑based verification flow.
Google Push Service is used on Android.
6. Scaling a Single Server to 2 Million Connections
Initial load: 200 k concurrent connections per server.
Planned capacity expansions to handle traffic spikes (e.g., football matches, earthquakes).
Goal: reach 1 million connections per server, later 2 million, with dynamic capacity planning.
7. Tools and Techniques for Enhancing Scalability
System activity reporting tool (wsar) that records OS, hardware, BEAM, and process metrics.
Hardware performance counters (pmcstat) to measure emulator CPU usage.
DTrace, kernel lock counters, fprof for debugging.
Various measurements and synthetic workloads to emulate production traffic.
Hot‑loading of Erlang code to apply changes without downtime.
Patch‑based enhancements to BEAM, Mnesia, and the network stack.
8. Lessons Learned
Optimization is arduous and requires continuous tooling, testing, and data‑driven iteration.
Accurate measurement and bottleneck elimination are essential for scaling.
Erlang proves to be a robust, high‑performance platform despite the need for extensive tuning.
Keeping the system simple, avoiding ads, and focusing on user privacy contributed to rapid adoption.
Identity tied to phone numbers simplifies design but imposes constraints.
Gradual, purposeful redundancy ensures availability during staff vacations.
Art of Distributed System Architecture Design
Introductions to large-scale distributed system architectures; insights and knowledge sharing on large-scale internet system architecture; front-end web architecture overviews; practical tips and experiences with PHP, JavaScript, Erlang, C/C++ and other languages in large-scale internet system development.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.