Operations 12 min read

How Facebook Scales to Billions: Disaggregated Networks, Storage, and Warm Spark

Facebook’s journey from early startup ops to supporting over 2 billion monthly users reveals how disaggregated network, storage, and warm‑storage‑enabled Spark architectures overcome scalability bottlenecks, illustrating the operational strategies and design principles that power massive, reliable data‑center services.

Efficient Ops

Nov 27, 2017

How Facebook Scales to Billions: Disaggregated Networks, Storage, and Warm Spark

Preface

In a small startup versus a mid‑size Twitter or massive Facebook, operational practices differ; after a decade in Silicon Valley the author shares insights on supporting product runtime with open‑source tools, cloud infrastructure, and rapid feature delivery.

When monthly active users exceed a billion and compute grows >50% annually, where should limited resources be allocated, especially if scale expands ten‑fold?

1. Challenges Facebook Faces at Scale

Facebook now serves over 2 billion monthly active users, generating data that grows from text to images to video at exponential rates. Existing technologies cannot sustain this scale, especially for big‑data compute platforms where storage and network traffic far exceed raw user data.

CPU performance no longer follows Moore's law; scalability now relies on horizontal expansion of distributed architectures. This shift demands new designs for network, storage, and compute clusters.

2. Concept of Disaggregation

Disaggregation replaces custom hardware with commodity servers, separates hardware and software development cycles, and decouples compute from storage, allowing each to scale independently.

Advantages include independent upgrade cadences for software and hardware, and the ability to tier storage into cold and warm layers while provisioning compute clusters with high‑memory or high‑CPU machines as needed.

3. Disaggregated Network

Facebook’s disaggregated network, called Fabric, is a high‑reliability, non‑bottleneck, high‑capacity core network that enables compute‑storage separation.

Previous generations used clustered switches (CSW) with 3+1 redundancy; scaling was limited by switch capacity, and a single switch failure could impact thousands of machines.

Fabric adopts a mesh architecture of rack and spine switches, providing multiple parallel paths between any two servers. Failure of one or more nodes does not disrupt traffic, allowing seamless scaling to hundreds of thousands of machines.

Network expansion is achieved by adding Pods; bandwidth upgrades are done by adding uplinks, providing independent scaling for compute and storage.

4. Disaggregated Storage

Traditional Hadoop relied on data locality, placing computation on the same node as the data to reduce network traffic. With abundant network bandwidth in a disaggregated setup, compute clusters can be separate from storage clusters.

Fixed CPU‑to‑memory ratios in Hadoop clusters made it difficult to scale compute or storage independently. Disaggregated storage also improves resilience: network bandwidth and latency far exceed local disk limits, and the system can tolerate individual disk failures without affecting overall reliability.

5. Spark with Warm Storage

Facebook built an internal Warm Storage system, a distributed storage layer optimized for large and small I/O sizes, reducing IOPS bottlenecks in Hive and Spark workloads.

In Spark, the shuffle phase writes intermediate data to local disks; a disk failure forces a costly retry. Large‑scale jobs amplify this risk, making traditional Spark‑HDFS co‑location problematic.

By decoupling compute nodes (high‑memory, high‑CPU, minimal local disk) from Warm Storage via high‑speed network, Spark can scale compute capacity independently, avoid HDFS‑induced I/O contention, and achieve up to four‑fold reliability improvements over local‑disk setups.

The disaggregated approach fundamentally changes system scalability, affecting the entire architecture rather than isolated features.

Conclusion

In massive environments, disaggregation effectively solves scalability challenges, though smaller‑scale scenarios may still benefit from integrated designs.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems Big Data Operations scalability cloud infrastructure

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.