How Baidu’s CCS System Scales Command Execution Across Millions of Servers
This article examines Baidu’s Cluster Control System (CCS), detailing its two‑level data model, four‑tier scheduling architecture, and three‑layer execution agents, and explains how control and execution information, redundancy, and fault‑tolerant designs enable reliable large‑scale command execution across thousands of servers.
Two‑Level Data Model
To meet the need for large‑scale command execution, the CCS data model combines control information (routing, concurrency, pause points) with execution information (authentication, authorization, command payload). This dual model enables flexible, reliable command distribution across many servers.
Design Considerations
The model must include both control and execution data to allow precise routing, concurrency limits, and pause points, ensuring safe and coordinated operations.
Four‑Tier Scheduling Model
Unified Access Layer
Provides a consistent entry point for users, implements traffic control via quotas and blocks, and uses VIPs to balance load.
Hierarchical Scheduling Layer
Acts as the core scheduling tier, linking upstream nodes with data‑center layers through a full‑mesh connection, isolating different business priorities to avoid interference.
Data Center Aggregation Layer
Aggregates local server status, dispatches commands, and collects results via heartbeat messages from execution agents, reducing cross‑data‑center traffic and preserving bandwidth.
Execution Proxy Layer
Hosts the CCS‑Agent on each server, handling authentication, authorization, backup, and command assembly before handing off to the generic execution layer.
Three‑Tier Proxy Execution
Design Considerations
Stability is paramount; the execution logic is split from the CCS‑Agent into a generic execution layer consisting of an execution proxy process, an execution endpoint process, and a user process, each isolated to prevent failures from propagating.
Exception Handling
Insufficient Capacity
Capacity‑related failures were the most common; the unified access layer and hierarchical scheduling mitigate these by applying quotas and isolating traffic.
Network Jitter
Jitter can cause message loss; CCS combines push and pull mechanisms, adjusting pull frequency to balance latency and reliability.
Single‑Machine Execution Exceptions
Various failure types (third‑party bugs, resource exhaustion, OS bugs, force‑majeure) are addressed by prioritizing backup before execution and using a single‑threaded design to simplify logic and improve stability.
Summary
By building the CCS system, Baidu solved the problem of executing commands at massive scale across its internal servers. While the system is stable and widely used, challenges such as command latency and hot‑standby strategies remain under active improvement.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.