Cloud Native 26 min read

Inside Borg: The Predecessor of Kubernetes and Its Architecture Explained

This article provides a comprehensive analysis of Google’s Borg system, covering its design goals, user view, job and task model, resource allocation, scheduling algorithms, fault tolerance, scalability techniques, and operational metrics that shaped modern cloud‑native orchestration platforms.

Java Architect Essentials
Java Architect Essentials
Java Architect Essentials
Inside Borg: The Predecessor of Kubernetes and Its Architecture Explained

1. Introduction

Borg is Google’s large‑scale cluster manager that runs thousands of jobs across tens of thousands of machines, providing high resource utilization, fault tolerance, and a declarative job description language. It hides the complexities of resource management and failure handling, allowing developers to focus on application logic.

2. User View

Users submit jobs consisting of one or more tasks that run the same binary. Jobs run within a cell (a logical unit of machines). Production jobs ("prod") handle latency‑sensitive services such as Gmail, Docs, and Search, while batch jobs run longer‑running workloads. In a typical cell, production jobs consume about 70 % of CPU capacity and 55 % of memory, while batch jobs use the remainder.

2.1 Workloads

Borg supports two workload types: long‑running services that are latency‑sensitive, and batch jobs that can run from seconds to days and are less sensitive to performance fluctuations.

2.2 Cells and Units

A cell is a set of machines connected by a high‑performance data‑center network. Each cell contains a single logical unit of roughly 10 k machines (some larger). Machines are heterogeneous in CPU, RAM, disk, network, and other attributes. Borg isolates users from this heterogeneity by allocating resources, installing software, and monitoring tasks.

2.3 Jobs and Tasks

Jobs have attributes such as name, owner, and task count. Tasks inherit job attributes but can override resource requirements, command‑line flags, and placement constraints. Tasks run inside Linux containers; most workloads run directly on the host without a VM to avoid virtualization overhead.

2.4 Allocation

An alloc is a reserved set of resources on one or more machines that can host one or more tasks. Allocations enable pre‑reservation of resources for future tasks, log collection, or service instances. Borg treats an alloc similarly to a job, allowing nested allocation hierarchies.

2.5 Priority, Quota, and Admission Control

Each job receives a small positive integer priority; higher‑priority tasks can preempt lower‑priority ones. Borg separates jobs into domains (monitoring, production, batch, best‑effort) with non‑overlapping priority weights. Quotas limit the amount of resources a job can request over a time window, preventing overload.

2.6 Naming and Monitoring

Borg creates a stable name service (BNS) for each task, encoding cell, job, and task IDs. This name resolves to a hostname and port via Chubby, enabling clients to locate services after relocation. Each task runs an embedded HTTP server exposing health checks and thousands of performance metrics.

3. Borg Architecture

A cell consists of a logical central controller called the Borgmaster and a Borglet agent on every machine.

3.1 Borgmaster

The Borgmaster runs two processes: a primary RPC server handling client requests and a separate scheduler. It maintains the entire cell state in memory, replicated five times with Paxos‑based persistent storage. One replica acts as the Paxos leader and mutator, handling state changes such as job submission and task termination. Checkpointing enables state recovery and offline simulation.

3.2 Scheduling

When a job is submitted, the Borgmaster persists it in Paxos and places its tasks in a pending queue. The scheduler asynchronously scans the queue, performing feasibility checks (finding machines with enough available resources, including pre‑emptible lower‑priority resources) and scoring (evaluating machine suitability based on factors like task packing, fault‑domain diversity, and resource fragmentation). Borg originally used an E‑PVM‑based scoring model, later evolving to a hybrid model that reduces fragmentation while achieving 3‑5 % better packing efficiency.

3.3 Borglet

Borglet runs on each machine, launching and stopping tasks, handling failures, managing local resources, and reporting status to the Borgmaster. The master polls Borglets every few seconds, sending pending requests and receiving state updates, which limits communication storms and enables graceful recovery.

3.4 Scalability

Borg scales to thousands of machines per cell and handles >10 000 task arrivals per minute. Techniques such as score caching, equivalence class analysis (scoring only one representative task per class), and relaxed random sampling of machines dramatically reduce scheduling latency—from hundreds of seconds for a full re‑schedule to sub‑second online passes.

4. Availability

Failures are expected in large systems. Borg mitigates impact through automatic task rescheduling, spreading tasks across fault domains (machines, racks, power zones), and rate‑limiting concurrent task interruptions during maintenance. Even if the Borgmaster or a Borglet crashes, already‑running tasks continue, and the system maintains 99.99 % availability.

Key design choices—replicated state, admission control, minimal external dependencies, and cell isolation—ensure high availability and prevent cascading failures across cells.

Borg high‑level architecture diagram
Borg high‑level architecture diagram
Job and task state diagram
Job and task state diagram
Task eviction rates for production vs. non‑production workloads
Task eviction rates for production vs. non‑production workloads
Compression effectiveness CDF
Compression effectiveness CDF
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Cloud NativeKubernetesSchedulingGoogleCluster ManagementBorg
Java Architect Essentials
Written by

Java Architect Essentials

Committed to sharing quality articles and tutorials to help Java programmers progress from junior to mid-level to senior architect. We curate high-quality learning resources, interview questions, videos, and projects from across the internet to help you systematically improve your Java architecture skills. Follow and reply '1024' to get Java programming resources. Learn together, grow together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.