How WhatsApp Scales to 450 Million Users with Just 32 Engineers
This article examines WhatsApp's high‑reliability architecture, detailing how a tiny team of 32 engineers leverages Erlang, FreeBSD, and custom BEAM patches to support hundreds of nodes, thousands of cores, and hundreds of terabytes of memory for over 450 million active users.
Background
WhatsApp was sold to Facebook for $19 billion, yet its engineering team that serves 450 million active users consists of only 32 engineers. HighScalability founder Tod Hoff analyzed the reasons behind the acquisition and WhatsApp's high‑reliability architecture.
Key Statistics
450 million active users (fastest growth to date)
32 engineers, each supporting ~14 million users
500 billion messages daily across 7 platforms
Hundreds of nodes, >8 000 cores, hundreds of TB RAM
>70 million Erlang messages per second
Platform Stack
Backend: Erlang on FreeBSD, using Yaws/lighttpd, PHP, custom BEAM patches, and a proprietary XMPP‑like protocol.
Clients: iPhone, Android, BlackBerry, Nokia Symbian, Nokia S40, Windows Phone, and others.
Data storage relies on SQLite on the client side.
Reliability Practices
All servers are implemented primarily in Erlang.
Initial server implementation used ejabberd, later heavily modified.
System health is monitored via message‑queue lengths; alerts trigger when thresholds are exceeded.
Multimedia messages are uploaded to an HTTP server and referenced by URL and Base64 thumbnail.
Erlang’s hot‑code loading enables rapid feature deployment without restarts.
SSL sockets queue messages on the server until the client reconnects.
Registration uses phone numbers and a PIN‑based verification flow.
Android clients use Google Push Service.
Scaling to 2 Million Connections per Server
Initial load was 200 k concurrent connections per server, with plans to increase capacity for global events. Dynamic capacity planning, hardware redundancy, and decoupled components were employed to handle traffic spikes such as major sports events or natural disasters.
Through iterative tuning—optimizing BEAM scheduling, reducing lock contention, and applying custom kernel patches—the team raised per‑server capacity to 2 million connections, later peaking at 2.8 million.
Performance Tools and Techniques
System activity reporting tool (wsar) collects OS, hardware, and BEAM metrics.
Hardware performance counters (pmcstat) monitor CPU time spent in the emulator.
DTrace, kernel lock counters, and fprof for debugging.
Custom BEAM patches for detailed scheduling and lock statistics.
Memory optimizations and Mseg allocator improvements.
Key Findings
Erlang + BEAM with custom patches provides near‑linear SMP scalability; lock contention is the primary bottleneck, mitigated by code fixes and scheduler tuning. Long‑lived idle connections consume minimal resources, allowing servers to handle massive numbers of simultaneous users.
Takeaways
Scaling a messaging service to hundreds of millions of users requires relentless measurement, bottleneck elimination, and iterative testing. Erlang proves to be a powerful platform for building reliable, high‑performance back‑ends, though it demands extensive tuning and custom engineering.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITFLY8 Architecture Home
ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
