Backend Development 22 min read

How WhatsApp Scales to 450 Million Users with Just 32 Engineers

This article examines WhatsApp's high‑reliability architecture, detailing how a tiny team of 32 engineers leverages Erlang, FreeBSD, and custom BEAM patches to support hundreds of nodes, thousands of cores, and hundreds of terabytes of memory for over 450 million active users.

ITFLY8 Architecture Home

Apr 12, 2018

How WhatsApp Scales to 450 Million Users with Just 32 Engineers

Background

WhatsApp was sold to Facebook for $19 billion, yet its engineering team that serves 450 million active users consists of only 32 engineers. HighScalability founder Tod Hoff analyzed the reasons behind the acquisition and WhatsApp's high‑reliability architecture.

Key Statistics

450 million active users (fastest growth to date)

32 engineers, each supporting ~14 million users

500 billion messages daily across 7 platforms

Hundreds of nodes, >8 000 cores, hundreds of TB RAM

>70 million Erlang messages per second

Platform Stack

Backend: Erlang on FreeBSD, using Yaws/lighttpd, PHP, custom BEAM patches, and a proprietary XMPP‑like protocol.

Clients: iPhone, Android, BlackBerry, Nokia Symbian, Nokia S40, Windows Phone, and others.

Data storage relies on SQLite on the client side.

Reliability Practices

All servers are implemented primarily in Erlang.

Initial server implementation used ejabberd, later heavily modified.

System health is monitored via message‑queue lengths; alerts trigger when thresholds are exceeded.

Multimedia messages are uploaded to an HTTP server and referenced by URL and Base64 thumbnail.

Erlang’s hot‑code loading enables rapid feature deployment without restarts.

SSL sockets queue messages on the server until the client reconnects.

Registration uses phone numbers and a PIN‑based verification flow.

Android clients use Google Push Service.

Scaling to 2 Million Connections per Server

Initial load was 200 k concurrent connections per server, with plans to increase capacity for global events. Dynamic capacity planning, hardware redundancy, and decoupled components were employed to handle traffic spikes such as major sports events or natural disasters.

Through iterative tuning—optimizing BEAM scheduling, reducing lock contention, and applying custom kernel patches—the team raised per‑server capacity to 2 million connections, later peaking at 2.8 million.

Performance Tools and Techniques

System activity reporting tool (wsar) collects OS, hardware, and BEAM metrics.

Hardware performance counters (pmcstat) monitor CPU time spent in the emulator.

DTrace, kernel lock counters, and fprof for debugging.

Custom BEAM patches for detailed scheduling and lock statistics.

Memory optimizations and Mseg allocator improvements.

Key Findings

Erlang + BEAM with custom patches provides near‑linear SMP scalability; lock contention is the primary bottleneck, mitigated by code fixes and scheduler tuning. Long‑lived idle connections consume minimal resources, allowing servers to handle massive numbers of simultaneous users.

Takeaways

Scaling a messaging service to hundreds of millions of users requires relentless measurement, bottleneck elimination, and iterative testing. Erlang proves to be a powerful platform for building reliable, high‑performance back‑ends, though it demands extensive tuning and custom engineering.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Scalability performance engineering Erlang WhatsApp

Written by

ITFLY8 Architecture Home

ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Background

Key Statistics

Platform Stack

Reliability Practices

Scaling to 2 Million Connections per Server

Performance Tools and Techniques

Key Findings

Takeaways

ITFLY8 Architecture Home

How this landed with the community

Was this worth your time?

0 Comments

Scaling to 2 Million Connections per Server