Mastering Kubernetes Operators: From Concepts to Real-World Implementation
This article explains what Kubernetes Operators are, why they are useful, and provides a detailed walkthrough of building an etcd‑cluster Operator with Go, covering CRDs, reconciliation loops, controller features, permissions, validation, testing, and best‑practice recommendations.
Key points:
Kubernetes API provides a single integration point for all cloud resources, promoting cloud‑native adoption.
Several frameworks and libraries simplify Operator development; Go ecosystem is the most mature.
Operators can manage third‑party software such as databases, enabling DevOps teams to automate external products.
The challenge lies in understanding Operator behavior rather than the Operator itself.
Operators have been a core part of the Kubernetes ecosystem for years, moving the management interface into the Kubernetes API and offering a “single pane of glass” experience. They are attractive for developers who want to simplify native apps or for DevOps engineers reducing system complexity. But how do you build an Operator from scratch?
Deep dive into Operators
What is an Operator?
Operators are now ubiquitous. Databases, cloud‑native projects, and any complex workload running on Kubernetes use Operators. CoreOS introduced the concept in 2016, moving operational concerns into software. An Operator can automatically deploy a database instance, upgrade versions, or perform backups, often faster than a human.
Operators extend the Kubernetes API via Custom Resource Definitions (CRDs), turning the API into a “single pane of glass”. DevOps engineers can leverage the rich ecosystem of tools built around the API to manage and monitor their applications, for example:
Use built‑in RBAC to modify authorization and authentication.
Apply GitOps for reproducible deployments and code review.
Enforce policies on custom resources with OPA.
Simplify deployment descriptions with Helm, Kustomize, ksonnet, or Terraform.
This approach ensures consistency across production, testing, and development environments when each cluster is a Kubernetes cluster.
Why use an Operator?
Operators are used either by development teams to create a custom Operator for their product, or by DevOps teams to automate management of third‑party software. Simple Operators that only deploy resources can be created with kubectl apply, but more complex Operators add value such as scaling databases, handling configuration differences, or performing automated backup, restore, metrics integration, fault detection, and auto‑tuning.
Operators can also manage resources outside of Kubernetes, for example cloud provider services like AWS S3 or Azure storage, by exposing them through the Kubernetes API.
Operator example
The article focuses on an etcd‑cluster‑operator used to manage an etcd cluster inside Kubernetes. etcd is a distributed key‑value store with each instance having its own failure domain, unique network name, and ability to discover peers.
Cluster growth or shrinkage is performed via the etcd management API.
Backups are taken through a “snapshot” endpoint using gRPC.
Restores use etcdctl and coordinate actions inside Kubernetes.
Operator anatomy
An Operator consists of one or more CRDs that define new resource types (e.g., EtcdCluster and EtcdPeer) and a controller that watches those resources and reconciles the desired state.
Operators are typically deployed as a Deployment in a dedicated Namespace, with a container image, ServiceAccount, RBAC bindings, and optional webhook configuration.
Software and tools
Any language capable of HTTP calls can build an Operator, but Go offers the most mature tooling. The controller‑runtime library, Kubebuilder, and Operator SDK streamline development. Other languages such as Java, Rust, and Python have libraries of varying maturity.
Custom resources and desired state
For the etcd Operator, a custom resource EtcdCluster is defined. Example manifest:
apiVersion: etcd.improbable.io/v1alpha1
kind: EtcdCluster
metadata:
name: my-first-etcd-cluster
spec:
replicas: 3
version: 3.2.28The spec describes the desired state; the controller updates the actual resources to match it. The status field reports the current state.
Using Kubebuilder, the Go struct for the spec is generated:
type EtcdClusterSpec struct {
Version string `json:"version"`
Replicas *int32 `json:"replicas"`
Storage *EtcdPeerStorage `json:"storage,omitempty"`
PodTemplate *EtcdPodTemplateSpec `json:"podTemplate,omitempty"`
}Reconciliation loop
The controller follows a three‑step loop: observe the desired state, observe the current state of managed resources, and act to make the actual state match the desired state. It may create Deployments, Services, PVCs, etc., and interact directly with etcd via its management API.
Controller features
Instead of polling every 30 seconds, controllers can watch the Kubernetes API, cache requests, and perform batch updates, reducing load and latency.
API watch
Watching registers interest in specific resources and receives notifications on changes, allowing the Operator to react immediately and stay idle otherwise.
API cache
Caching reduces API server load but requires handling cache expiration and potential duplicate creation errors.
Batch updates
When many resources change simultaneously, batching prevents the controller from performing redundant work.
Permissions
Operators need a ServiceAccount with minimal RBAC permissions to get, list, watch, create, update, and delete the resources they manage. Example RBAC annotations generated by Kubebuilder:
//+kubebuilder:rbac:groups=etcd.improbable.io,resources=etcdpeers,verbs=get;list;watch
//+kubebuilder:rbac:groups=etcd.improbable.io,resources=etcdpeers/status,verbs=get;update;patch
//+kubebuilder:rbac:groups=apps,resources=replicasets,verbs=list;get;create;watch
//+kubebuilder:rbac:groups=core,resources=persistentvolumeclaims,verbs=list;get;create;watch;deleteValidation and defaults
CRDs provide basic validation, but complex checks and defaulting are often implemented in the Operator or via mutating webhooks.
Testing
Unit tests can cover individual logic units, while integration tests benefit from a real Kubernetes cluster. Tools like kind (Kubernetes in Docker) enable fast, realistic integration testing.
Conclusion
Deploy Operators as Pods in the cluster.
Go offers the most mature ecosystem, though any language can be used.
Handle non‑Kubernetes resources carefully, especially during network failures.
Perform a single action per reconciliation cycle and re‑queue as needed.
Adopt a “condition‑based” approach rather than “edge‑based”.
Use deterministic naming for generated resources.
Grant the ServiceAccount the least privileges required.
Apply defaults both in webhooks and in code.
Use kind for reliable integration testing.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
