Industry Insights 19 min read

How WhatsApp Scaled to 450 Million Users with Erlang: Architecture and Lessons

This article dissects WhatsApp’s high‑reliability architecture that supports 450 million users, detailing its Erlang‑based backend, hardware choices, scaling techniques, monitoring tools, and the engineering lessons learned from pushing a single server to two‑million concurrent connections.

Art of Distributed System Architecture Design
Art of Distributed System Architecture Design
Art of Distributed System Architecture Design
How WhatsApp Scaled to 450 Million Users with Erlang: Architecture and Lessons

Overview

WhatsApp’s complete system design is not publicly documented, but a collection of talks, interviews, and open‑source artifacts reveals how the service achieved massive scale with a surprisingly small infrastructure.

Key Statistics

450 million active users, reached faster than any prior company.

32 engineers, each supporting roughly 14 million active users.

500 billion messages exchanged daily across seven platforms.

Over one million new registrations per day.

Zero advertising spend and $8 million of initial investment.

Hundreds of servers, thousands of cores, and hundreds of terabytes of RAM.

Peak Erlang message rate exceeds 70 million messages per second.

In 2011 a single server handled 1 million TCP sessions; by 2012 it handled 2 million.

Platform Stack

Backend

Erlang VM (BEAM) with custom patches.

FreeBSD operating system.

Web servers: Yaws, lighttpd.

PHP for auxiliary services.

Custom XMPP implementation.

Frontend

Clients on iPhone, Android, BlackBerry, Nokia Symbian 360, Nokia S40, Windows Phone, and an unknown platform.

SQLite for local storage.

Hardware Profile

Dual Westmere hex‑core servers (24 logical CPUs).

100 GB RAM with SSD storage.

Dual NICs for public and private networks.

Product Focus

Messaging‑centric : Simple, low‑cost text and media exchange without geographic constraints.

Privacy‑first : No server‑side storage of chat history; authentication originally tied to phone IMEI and later to a PIN‑based system.

General Technical Observations

Entire server side is written in Erlang, handling message routing and queuing.

Initial implementation built on the open‑source ejabberd XMPP server, later heavily rewritten and patched for performance.

Reliability is prioritized; monetization was a distant concern.

System health is monitored via per‑node message‑queue lengths; alerts trigger when thresholds are crossed.

Multimedia messages are uploaded to an HTTP server; the message contains a link and a Base64 thumbnail.

Erlang’s hot‑code loading enables frequent updates without restarts, keeping the service loosely coupled.

SSL sockets queue messages on the server until the client reconnects and retrieves them; successful delivery causes the server to delete the stored copy.

Registration originally used the device’s IMEI; later a 5‑digit PIN sent via SMS creates a unique key for authentication.

Android clients use Google Cloud Messaging for push notifications.

Developers found Android easier for rapid prototyping compared with iOS.

Scaling a Single Server to Two‑Million Connections

Growth required more hardware and introduced operational complexity.

Capacity planning had to account for spikes from events such as football matches or earthquakes.

Initial target was 200 k concurrent connections per server; the ultimate goal was one million per server.

Tools and Techniques for Enhancing Scalability

System Activity Reporter (wsar) : Collects OS, hardware, and BEAM metrics at sub‑second intervals during load spikes.

CPU hardware counters (pmcstat) : Measure emulator cycle time to identify where optimization effort yields the most benefit.

DTrace, kernel lock counters, fprof : Debugging and profiling utilities; DTrace is used mainly for debugging, not performance tuning.

Problem areas discovered : Excessive garbage‑collection time, network‑stack bottlenecks, scheduler lock contention.

Metrics methodology : Synthetic workloads generated by test scripts; real‑traffic replayed in isolated environments; throttling of GC during queue buildup; monitoring of packet rates, CPU, VM, and scheduler utilization.

Results of the Scaling Experiments

Initial baseline: 200 k concurrent connections per server.

First bottleneck appeared around 425 k connections; CPU usage plateaued at 35‑45 % while the scheduler consumed 95 %.

After the first round of fixes, connections exceeded 1 million.

VM utilization stabilized at ~76 % and CPU at ~73 %.

One month later each server handled 2 million connections, with BEAM utilization at 80 %.

Peak observed: 2.8 million connections per server, 571 k packets/sec, >200 k distributed messages/sec.

Further attempts to reach 3 million connections were unsuccessful.

Key Findings

Erlang/BEAM with custom patches provides near‑linear SMP scalability on 24‑core machines.

Long‑lived idle connections consume minimal CPU but increase overall connection count.

Scheduler lock contention is the primary performance limiter; reducing lock usage and improving partitioned load dramatically helps.

Optimizing hash tables, Mseg allocator, and reducing port‑related syscalls improves throughput.

Hot‑code loading allows rapid iteration without downtime.

Instrumentation added negligible overhead and proved essential for production‑grade monitoring.

Experience Summary

Optimization is labor‑intensive; engineers must continuously write tools, add patches, and fine‑tune knobs.

Data collection, metric‑driven debugging, and iterative testing are the backbone of scaling work.

Erlang’s strengths in reliability and concurrency outweigh the cost of extensive VM customizations.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

backend architectureoperationsScalabilityErlangWhatsApp
Art of Distributed System Architecture Design
Written by

Art of Distributed System Architecture Design

Introductions to large-scale distributed system architectures; insights and knowledge sharing on large-scale internet system architecture; front-end web architecture overviews; practical tips and experiences with PHP, JavaScript, Erlang, C/C++ and other languages in large-scale internet system development.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.