Operations 15 min read

How WhatsApp Scaled to 400 Million Daily Messages with a 10‑Engineer Team

This article examines how WhatsApp grew its user base, message volume, and infrastructure dramatically over two years while keeping the engineering team at ten, detailing the architectural choices, Erlang‑based stack, hardware scaling, decoupling strategies, and the operational challenges that led to both performance gains and a major outage.

Art of Distributed System Architecture Design
Art of Distributed System Architecture Design
Art of Distributed System Architecture Design
How WhatsApp Scaled to 400 Million Daily Messages with a 10‑Engineer Team

Overview of WhatsApp’s Two‑Year Growth

WhatsApp’s scale today is incomparable to two years ago. Despite handling far more users, messages, and connections, the engineering team has remained at roughly ten engineers, each responsible for about 40 million users. This is made possible by off‑loading networking, hardware, and data‑center operations to cloud services.

Key Statistics

4.65 billion monthly active users

190 billion messages received daily, 400 billion sent

600 million images, 200 million voice clips, 100 million video clips

Peak concurrent connections: 147 million

Peak login operations per second: 230 k

Peak inbound/outbound message rates: 324 k / 712 k per second

~10 engineers responsible for development and operations

Holiday‑season peaks: 146 Gb/s outbound traffic on Christmas Eve, 360 million video downloads, 2 billion image downloads on New Year’s Eve, a single image downloaded 32 million times

Technology Stack

Erlang R16B01 (with custom patches)

FreeBSD 9.2

Mnesia database

Yaws web server

SoftLayer cloud and dedicated servers

Hardware Deployment

~550 servers (including spares)

~150 chat servers, each handling ~1 million phones and 150 million peak connections

~250 multimedia servers

2 × 2690v2 Ivy Bridge 10‑core CPUs (40 hyper‑threads total)

Database nodes with 512 GB RAM

Standard compute nodes with 64 GB RAM

SSDs for reliability and video storage when needed

Dual‑link GigE (public user‑facing and private backend)

Over 11 000 Erlang VM cores in production

System Architecture

The system is built around Erlang’s strengths: excellent SMP scalability, fast code upgrades, and a lightweight process model. Messages flow from mobile clients to an MMS (multimedia) layer, then to chat services, which interact with various databases (Account, Profile, Push, Group, etc.). Only active messages and media are stored, with Mnesia holding roughly 2 TB of RAM across 16 shards and 18 billion records.

Decoupling and Bottleneck Isolation

Isolate bottlenecks to prevent cascading failures.

Separate front‑end and back‑end services.

Use asynchronous processing to keep throughput high.

Employ FIFO queues for unpredictable latency scenarios.

Partition read/write queues to avoid write‑induced read stalls.

Distribute messages across lightweight Erlang processes; if a node fails, only its messages are affected.

Parallelism and Task Distribution

Distribute work across >11 000 cores using a gen_server → gen_factory → gen_industry pipeline.

Introduce a gen_industry layer to parallelize input handling and immediate worker assignment.

Shard services into 2–32 groups, using pg2 for distributed process groups and node‑level master‑slave setups for disaster recovery.

Limit concurrent access to single ets/mnesia processes to eight to control lock contention.

Mnesia Optimizations

Prefer async_dirty operations over transactions for most workloads.

Use cast (asynchronous) instead of call (synchronous) to avoid mailbox blocking.

Patch OTP to support multiple async_dirty transaction managers, increasing parallel writes.

Split Mnesia tables into many fragments (e.g., 512 fragments for account tables) and group them into “islands” to reduce replication traffic.

Apply custom patches to improve network‑partition handling and async_dirty response times.

Performance Optimizations

Add a write‑back cache achieving ~98 % hit rate, buffering messages before filesystem writes.

Patch the BEAM VM for asynchronous file I/O, reducing mailbox stalls under heavy load.

Exclude large mailboxes from cache to prevent a few high‑traffic groups from degrading overall performance.

Increase fragment count to lower Mnesia table access latency.

Address hash‑bucket bloat by redesigning the hash algorithm, improving throughput from 4× to 1× baseline.

Patches and Fixes

Mitigate timer‑wheel lock contention by sharding timers across multiple wheels.

Patch mnesia_tm to prevent transaction backlog under high load.

Introduce multiple async_dirty senders for Mnesia.

Prefer loading Mnesia data from nearby nodes to reduce cross‑cluster traffic.

Enhance ETS tables for large‑scale usage and avoid excessive dump queues.

February 22 Outage

A 210‑minute outage occurred after Facebook’s acquisition, triggered by a routing issue that disabled a LAN segment, causing massive node disconnects and reconnections. An over‑coupled subsystem involving pg2 generated n³ messages, inflating the message queue to 4 million within seconds, prompting a critical patch.

Feature Release Strategy

Due to the impossibility of simulating full‑scale traffic, new features are rolled out gradually: first on low‑traffic clusters, then iteratively expanded after validation. Updates are performed as rolling restarts; hot patches are rare and complex.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Distributed SystemsPerformance OptimizationOperationsScalabilityErlangMnesiaWhatsApp
Art of Distributed System Architecture Design
Written by

Art of Distributed System Architecture Design

Introductions to large-scale distributed system architectures; insights and knowledge sharing on large-scale internet system architecture; front-end web architecture overviews; practical tips and experiences with PHP, JavaScript, Erlang, C/C++ and other languages in large-scale internet system development.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.