Understanding Kubernetes’ Core Design: Declarative Data, List‑Watch, and Level Trigger
This article explains Kubernetes’ fundamental design principles—including declarative state, level‑trigger messaging, and the list‑watch mechanism—by walking through a ReplicationController‑based runtime flow, detailing each component’s role, and analyzing the reliability, ordering, and real‑time requirements of the watch system.
1. Introduction
Kubernetes has become a focal point in the container‑cloud ecosystem. The author, an Alibaba technical expert, begins a series on Kubernetes design, focusing first on its overall runtime flow and the list‑watch mechanism that embody the platform’s core philosophies.
2. Core Design Principles
System operation is based on declarative data rather than imperative commands.
Inter‑component messaging uses Level Trigger instead of Edge Trigger.
Runtime state is controlled by various closed‑loop controllers.
Extensibility is achieved through abstract interfaces such as CRI, CNI, CSI, the scheduler, and admission controllers.
3. Overall Runtime Flow Using a ReplicationController
The article illustrates the full lifecycle of a Pod by using a ReplicationController (RC) as the entry point.
RC Creation (Process 1) : A user creates an RC via the kube-apiserver REST API (e.g., using
curl -XPOST -d "v1.ReplicationController" -H "Content-Type: application/json" http://ip:port/api/v1/namespaces/{namespace}/replicationcontrollers). The submitted object contains declarative specifications such as replica count and container image.
Pod Creation (Process 2‑3) : kube-controller-manager watches the newly created RC, then creates the required Pods to match the desired replica count.
Pod Scheduling (Process 4‑5) : kube-scheduler watches the fresh Pods, selects suitable nodes, and updates each Pod’s spec.nodeName field.
Pod Execution (Process 6‑7) : On the chosen nodes, kubelet watches the scheduled Pods, invokes the container runtime to start the containers, and updates the Pod status in etcd.
Pod Status Transitions : The Pod moves from Pending (after creation) to Running (after scheduling and container start), with conditions such as PodScheduled=true, PodInitialized=true, and PodReady=true reflecting each stage.
4. List‑Watch Mechanism
Kubernetes adopts a Level Trigger model, requiring components to see only the latest state. The mechanism must satisfy three requirements: real‑time delivery, ordered delivery, and reliable recovery from failures.
4.1 Real‑time Delivery (Requirement 1)
HTTP long polling – the client repeatedly issues requests; each response contains data when available. Drawback: higher request‑response overhead.
HTTP streaming – the client opens a single request; the server sends a chunked response as events occur. Drawback: client must handle custom streaming format.
Kubernetes chooses HTTP streaming, calling it a watch request.
4.2 Ordered Delivery (Requirement 2)
Every REST object carries a ResourceVersion field, monotonically increased by etcd. A watch request includes the latest known version; the server then streams objects with greater versions, guaranteeing order.
type ObjectMeta struct {
Name string
// ...
ResourceVersion string // ensures ordering
// ...
}4.3 Reliable Delivery (Requirement 3)
Before starting a watch, a client performs a list request to obtain the current full set of objects and the latest ResourceVersion. If a watch fails (e.g., network partition), the client relists and resumes watching from the newest version. The implementation resides in
kubernetes/vendor/k8s.io/client-go/tools/cache/reflector.go#ListAndWatch().
5. Reflections on List‑Watch
Full list operations can be costly on large clusters (e.g., hundreds of thousands of Pods), so minimizing relist frequency is important.
HTTP/1.1 creates a separate TCP connection per watch; network failures may go unnoticed until keep‑alive timeouts. HTTP/2 multiplexes connections, reducing server load but changing failure detection semantics.
Watch requests include a timeoutSeconds (typically 5–10 minutes) to let the server close idle connections and avoid resource leaks.
The level‑trigger design eliminates the need for external message queues (e.g., RabbitMQ), simplifying the overall architecture.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
