Inside Borg: The Predecessor of Kubernetes and Its Architecture Explained
This article provides a comprehensive analysis of Google’s Borg system, covering its design goals, user view, job and task model, resource allocation, scheduling algorithms, fault tolerance, scalability techniques, and operational metrics that shaped modern cloud‑native orchestration platforms.
1. Introduction
Borg is Google’s large‑scale cluster manager that runs thousands of jobs across tens of thousands of machines, providing high resource utilization, fault tolerance, and a declarative job description language. It hides the complexities of resource management and failure handling, allowing developers to focus on application logic.
2. User View
Users submit jobs consisting of one or more tasks that run the same binary. Jobs run within a cell (a logical unit of machines). Production jobs ("prod") handle latency‑sensitive services such as Gmail, Docs, and Search, while batch jobs run longer‑running workloads. In a typical cell, production jobs consume about 70 % of CPU capacity and 55 % of memory, while batch jobs use the remainder.
2.1 Workloads
Borg supports two workload types: long‑running services that are latency‑sensitive, and batch jobs that can run from seconds to days and are less sensitive to performance fluctuations.
2.2 Cells and Units
A cell is a set of machines connected by a high‑performance data‑center network. Each cell contains a single logical unit of roughly 10 k machines (some larger). Machines are heterogeneous in CPU, RAM, disk, network, and other attributes. Borg isolates users from this heterogeneity by allocating resources, installing software, and monitoring tasks.
2.3 Jobs and Tasks
Jobs have attributes such as name, owner, and task count. Tasks inherit job attributes but can override resource requirements, command‑line flags, and placement constraints. Tasks run inside Linux containers; most workloads run directly on the host without a VM to avoid virtualization overhead.
2.4 Allocation
An alloc is a reserved set of resources on one or more machines that can host one or more tasks. Allocations enable pre‑reservation of resources for future tasks, log collection, or service instances. Borg treats an alloc similarly to a job, allowing nested allocation hierarchies.
2.5 Priority, Quota, and Admission Control
Each job receives a small positive integer priority; higher‑priority tasks can preempt lower‑priority ones. Borg separates jobs into domains (monitoring, production, batch, best‑effort) with non‑overlapping priority weights. Quotas limit the amount of resources a job can request over a time window, preventing overload.
2.6 Naming and Monitoring
Borg creates a stable name service (BNS) for each task, encoding cell, job, and task IDs. This name resolves to a hostname and port via Chubby, enabling clients to locate services after relocation. Each task runs an embedded HTTP server exposing health checks and thousands of performance metrics.
3. Borg Architecture
A cell consists of a logical central controller called the Borgmaster and a Borglet agent on every machine.
3.1 Borgmaster
The Borgmaster runs two processes: a primary RPC server handling client requests and a separate scheduler. It maintains the entire cell state in memory, replicated five times with Paxos‑based persistent storage. One replica acts as the Paxos leader and mutator, handling state changes such as job submission and task termination. Checkpointing enables state recovery and offline simulation.
3.2 Scheduling
When a job is submitted, the Borgmaster persists it in Paxos and places its tasks in a pending queue. The scheduler asynchronously scans the queue, performing feasibility checks (finding machines with enough available resources, including pre‑emptible lower‑priority resources) and scoring (evaluating machine suitability based on factors like task packing, fault‑domain diversity, and resource fragmentation). Borg originally used an E‑PVM‑based scoring model, later evolving to a hybrid model that reduces fragmentation while achieving 3‑5 % better packing efficiency.
3.3 Borglet
Borglet runs on each machine, launching and stopping tasks, handling failures, managing local resources, and reporting status to the Borgmaster. The master polls Borglets every few seconds, sending pending requests and receiving state updates, which limits communication storms and enables graceful recovery.
3.4 Scalability
Borg scales to thousands of machines per cell and handles >10 000 task arrivals per minute. Techniques such as score caching, equivalence class analysis (scoring only one representative task per class), and relaxed random sampling of machines dramatically reduce scheduling latency—from hundreds of seconds for a full re‑schedule to sub‑second online passes.
4. Availability
Failures are expected in large systems. Borg mitigates impact through automatic task rescheduling, spreading tasks across fault domains (machines, racks, power zones), and rate‑limiting concurrent task interruptions during maintenance. Even if the Borgmaster or a Borglet crashes, already‑running tasks continue, and the system maintains 99.99 % availability.
Key design choices—replicated state, admission control, minimal external dependencies, and cell isolation—ensure high availability and prevent cascading failures across cells.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Java Architect Essentials
Committed to sharing quality articles and tutorials to help Java programmers progress from junior to mid-level to senior architect. We curate high-quality learning resources, interview questions, videos, and projects from across the internet to help you systematically improve your Java architecture skills. Follow and reply '1024' to get Java programming resources. Learn together, grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
