Operations 15 min read

How Facebook Scales Millions of Servers with Twine: Inside Its Cluster Management Engine

This article explains how Facebook’s Twine system orchestrates containers across millions of servers, detailing its architecture, support for stateful services, cross‑data‑center control, elastic capacity handling, and the lessons learned from eight years of large‑scale operations.

Cloud Native Technology Community

Mar 3, 2021

How Facebook Scales Millions of Servers with Twine: Inside Its Cluster Management Engine

Introduction

At the Systems@Scale conference, Facebook presented Twine, its internal IaaS layer that now runs the majority of Facebook’s servers. Since its first deployment in 2011, Twine has grown from a single data center to a fleet of fifteen geographically distributed centers, managing millions of servers.

Twine Architecture

Twine consists of several components:

Frontend : Provides UI, CLI, and API for users, abstracting internal details.

Scheduler : Controls job and container lifecycles. It is sharded; regional schedulers manage servers in a region, while a global scheduler oversees multiple regions.

Scheduler Agent : Hides sharding details and presents a single control plane to users.

Allocator : Assigns containers to servers and orchestrates start, stop, update, and fail‑over actions. It currently scales to an entire region without further sharding.

Resource Broker : Stores server inventory and maintenance events, works with a capacity‑management system to decide which scheduler controls which servers.

Agent Daemon : Runs on each server to set up and tear down containers using images, btrfs, cgroupv2, and systemd.

Key Features

Seamless Support for Stateful Services – Twine runs critical stateful workloads such as ZippyDB, ODS Gorilla, and Scuba for Facebook, Instagram, Messenger, and WhatsApp. It provides a TaskControl interface that lets stateful services influence container lifecycle decisions (e.g., delaying upgrades on servers that host database replicas).

Cross‑Data‑Center Single Control Plane – Early Twine versions managed each cluster with a dedicated scheduler, limiting jobs to a single cluster. Introducing the Resource Broker enabled dynamic binding of servers to schedulers, allowing jobs to span clusters and data centers, simplifying retirements and maintenance.

Transparent Sharding for Scalability – To support a global shared pool of hundreds of thousands of servers, Twine shards its schedulers. Each shard handles a subset of jobs, reducing risk and allowing additional shards to be added as the pool grows, while presenting users with a unified control surface.

Elastic Capacity and Utilization

Twine leverages elastic computing to improve server utilization. During off‑peak periods, online services are scaled down and the freed servers are offered to offline workloads such as machine‑learning and MapReduce jobs. A Resource Broker tracks server health and can quickly reclaim servers for online services when demand spikes.

Lessons Learned and Future Work

Maintain a flexible mapping between the control plane and managed servers to enable cross‑data‑center automation and elastic capacity transfers.

A single abstracted control plane per region improves usability while still allowing internal sharding for fault tolerance.

Plugin models let external applications react to container lifecycle events, simplifying support for diverse stateful services.

Elastic capacity processes that release whole servers for batch workloads are effective for improving utilization on energy‑efficient hardware.

Facebook is still expanding its shared pool; currently about 20% of servers are in the shared pool, with goals to reach full sharing while addressing storage pool support, automated maintenance, multi‑tenant controls, and better support for machine‑learning workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Operations Facebook Resource Scheduling cluster management stateful services large‑scale infrastructure Twine

Written by

Cloud Native Technology Community

The Cloud Native Technology Community, part of the CNBPA Cloud Native Technology Practice Alliance, focuses on evangelizing cutting‑edge cloud‑native technologies and practical implementations. It shares in‑depth content, case studies, and event/meetup information on containers, Kubernetes, DevOps, Service Mesh, and other cloud‑native tech, along with updates from the CNBPA alliance.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.