How WhatsApp Scaled to 450 Million Users with Erlang: Architecture and Lessons
This article dissects WhatsApp’s high‑reliability architecture that supports 450 million users, detailing its Erlang‑based backend, hardware choices, scaling techniques, monitoring tools, and the engineering lessons learned from pushing a single server to two‑million concurrent connections.
Overview
WhatsApp’s complete system design is not publicly documented, but a collection of talks, interviews, and open‑source artifacts reveals how the service achieved massive scale with a surprisingly small infrastructure.
Key Statistics
450 million active users, reached faster than any prior company.
32 engineers, each supporting roughly 14 million active users.
500 billion messages exchanged daily across seven platforms.
Over one million new registrations per day.
Zero advertising spend and $8 million of initial investment.
Hundreds of servers, thousands of cores, and hundreds of terabytes of RAM.
Peak Erlang message rate exceeds 70 million messages per second.
In 2011 a single server handled 1 million TCP sessions; by 2012 it handled 2 million.
Platform Stack
Backend
Erlang VM (BEAM) with custom patches.
FreeBSD operating system.
Web servers: Yaws, lighttpd.
PHP for auxiliary services.
Custom XMPP implementation.
Frontend
Clients on iPhone, Android, BlackBerry, Nokia Symbian 360, Nokia S40, Windows Phone, and an unknown platform.
SQLite for local storage.
Hardware Profile
Dual Westmere hex‑core servers (24 logical CPUs).
100 GB RAM with SSD storage.
Dual NICs for public and private networks.
Product Focus
Messaging‑centric : Simple, low‑cost text and media exchange without geographic constraints.
Privacy‑first : No server‑side storage of chat history; authentication originally tied to phone IMEI and later to a PIN‑based system.
General Technical Observations
Entire server side is written in Erlang, handling message routing and queuing.
Initial implementation built on the open‑source ejabberd XMPP server, later heavily rewritten and patched for performance.
Reliability is prioritized; monetization was a distant concern.
System health is monitored via per‑node message‑queue lengths; alerts trigger when thresholds are crossed.
Multimedia messages are uploaded to an HTTP server; the message contains a link and a Base64 thumbnail.
Erlang’s hot‑code loading enables frequent updates without restarts, keeping the service loosely coupled.
SSL sockets queue messages on the server until the client reconnects and retrieves them; successful delivery causes the server to delete the stored copy.
Registration originally used the device’s IMEI; later a 5‑digit PIN sent via SMS creates a unique key for authentication.
Android clients use Google Cloud Messaging for push notifications.
Developers found Android easier for rapid prototyping compared with iOS.
Scaling a Single Server to Two‑Million Connections
Growth required more hardware and introduced operational complexity.
Capacity planning had to account for spikes from events such as football matches or earthquakes.
Initial target was 200 k concurrent connections per server; the ultimate goal was one million per server.
Tools and Techniques for Enhancing Scalability
System Activity Reporter (wsar) : Collects OS, hardware, and BEAM metrics at sub‑second intervals during load spikes.
CPU hardware counters (pmcstat) : Measure emulator cycle time to identify where optimization effort yields the most benefit.
DTrace, kernel lock counters, fprof : Debugging and profiling utilities; DTrace is used mainly for debugging, not performance tuning.
Problem areas discovered : Excessive garbage‑collection time, network‑stack bottlenecks, scheduler lock contention.
Metrics methodology : Synthetic workloads generated by test scripts; real‑traffic replayed in isolated environments; throttling of GC during queue buildup; monitoring of packet rates, CPU, VM, and scheduler utilization.
Results of the Scaling Experiments
Initial baseline: 200 k concurrent connections per server.
First bottleneck appeared around 425 k connections; CPU usage plateaued at 35‑45 % while the scheduler consumed 95 %.
After the first round of fixes, connections exceeded 1 million.
VM utilization stabilized at ~76 % and CPU at ~73 %.
One month later each server handled 2 million connections, with BEAM utilization at 80 %.
Peak observed: 2.8 million connections per server, 571 k packets/sec, >200 k distributed messages/sec.
Further attempts to reach 3 million connections were unsuccessful.
Key Findings
Erlang/BEAM with custom patches provides near‑linear SMP scalability on 24‑core machines.
Long‑lived idle connections consume minimal CPU but increase overall connection count.
Scheduler lock contention is the primary performance limiter; reducing lock usage and improving partitioned load dramatically helps.
Optimizing hash tables, Mseg allocator, and reducing port‑related syscalls improves throughput.
Hot‑code loading allows rapid iteration without downtime.
Instrumentation added negligible overhead and proved essential for production‑grade monitoring.
Experience Summary
Optimization is labor‑intensive; engineers must continuously write tools, add patches, and fine‑tune knobs.
Data collection, metric‑driven debugging, and iterative testing are the backbone of scaling work.
Erlang’s strengths in reliability and concurrency outweigh the cost of extensive VM customizations.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Art of Distributed System Architecture Design
Introductions to large-scale distributed system architectures; insights and knowledge sharing on large-scale internet system architecture; front-end web architecture overviews; practical tips and experiences with PHP, JavaScript, Erlang, C/C++ and other languages in large-scale internet system development.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
