Cloud Native 23 min read

Mastering Kubernetes Operators: From Concepts to Real-World Implementation

This article explains what Kubernetes Operators are, why they are useful, and provides a detailed walkthrough of building an etcd‑cluster Operator with Go, covering CRDs, reconciliation loops, controller features, permissions, validation, testing, and best‑practice recommendations.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
Mastering Kubernetes Operators: From Concepts to Real-World Implementation

Key points:

Kubernetes API provides a single integration point for all cloud resources, promoting cloud‑native adoption.

Several frameworks and libraries simplify Operator development; Go ecosystem is the most mature.

Operators can manage third‑party software such as databases, enabling DevOps teams to automate external products.

The challenge lies in understanding Operator behavior rather than the Operator itself.

Operators have been a core part of the Kubernetes ecosystem for years, moving the management interface into the Kubernetes API and offering a “single pane of glass” experience. They are attractive for developers who want to simplify native apps or for DevOps engineers reducing system complexity. But how do you build an Operator from scratch?

Deep dive into Operators

What is an Operator?

Operators are now ubiquitous. Databases, cloud‑native projects, and any complex workload running on Kubernetes use Operators. CoreOS introduced the concept in 2016, moving operational concerns into software. An Operator can automatically deploy a database instance, upgrade versions, or perform backups, often faster than a human.

Operators extend the Kubernetes API via Custom Resource Definitions (CRDs), turning the API into a “single pane of glass”. DevOps engineers can leverage the rich ecosystem of tools built around the API to manage and monitor their applications, for example:

Use built‑in RBAC to modify authorization and authentication.

Apply GitOps for reproducible deployments and code review.

Enforce policies on custom resources with OPA.

Simplify deployment descriptions with Helm, Kustomize, ksonnet, or Terraform.

This approach ensures consistency across production, testing, and development environments when each cluster is a Kubernetes cluster.

Why use an Operator?

Operators are used either by development teams to create a custom Operator for their product, or by DevOps teams to automate management of third‑party software. Simple Operators that only deploy resources can be created with kubectl apply, but more complex Operators add value such as scaling databases, handling configuration differences, or performing automated backup, restore, metrics integration, fault detection, and auto‑tuning.

Operators can also manage resources outside of Kubernetes, for example cloud provider services like AWS S3 or Azure storage, by exposing them through the Kubernetes API.

Operator example

The article focuses on an etcd‑cluster‑operator used to manage an etcd cluster inside Kubernetes. etcd is a distributed key‑value store with each instance having its own failure domain, unique network name, and ability to discover peers.

Cluster growth or shrinkage is performed via the etcd management API.

Backups are taken through a “snapshot” endpoint using gRPC.

Restores use etcdctl and coordinate actions inside Kubernetes.

Operator anatomy

An Operator consists of one or more CRDs that define new resource types (e.g., EtcdCluster and EtcdPeer) and a controller that watches those resources and reconciles the desired state.

Operators are typically deployed as a Deployment in a dedicated Namespace, with a container image, ServiceAccount, RBAC bindings, and optional webhook configuration.

Software and tools

Any language capable of HTTP calls can build an Operator, but Go offers the most mature tooling. The controller‑runtime library, Kubebuilder, and Operator SDK streamline development. Other languages such as Java, Rust, and Python have libraries of varying maturity.

Custom resources and desired state

For the etcd Operator, a custom resource EtcdCluster is defined. Example manifest:

apiVersion: etcd.improbable.io/v1alpha1
kind: EtcdCluster
metadata:
  name: my-first-etcd-cluster
spec:
  replicas: 3
  version: 3.2.28

The spec describes the desired state; the controller updates the actual resources to match it. The status field reports the current state.

Using Kubebuilder, the Go struct for the spec is generated:

type EtcdClusterSpec struct {
    Version    string `json:"version"`
    Replicas   *int32 `json:"replicas"`
    Storage    *EtcdPeerStorage `json:"storage,omitempty"`
    PodTemplate *EtcdPodTemplateSpec `json:"podTemplate,omitempty"`
}

Reconciliation loop

The controller follows a three‑step loop: observe the desired state, observe the current state of managed resources, and act to make the actual state match the desired state. It may create Deployments, Services, PVCs, etc., and interact directly with etcd via its management API.

Controller features

Instead of polling every 30 seconds, controllers can watch the Kubernetes API, cache requests, and perform batch updates, reducing load and latency.

API watch

Watching registers interest in specific resources and receives notifications on changes, allowing the Operator to react immediately and stay idle otherwise.

API cache

Caching reduces API server load but requires handling cache expiration and potential duplicate creation errors.

Batch updates

When many resources change simultaneously, batching prevents the controller from performing redundant work.

Permissions

Operators need a ServiceAccount with minimal RBAC permissions to get, list, watch, create, update, and delete the resources they manage. Example RBAC annotations generated by Kubebuilder:

//+kubebuilder:rbac:groups=etcd.improbable.io,resources=etcdpeers,verbs=get;list;watch
//+kubebuilder:rbac:groups=etcd.improbable.io,resources=etcdpeers/status,verbs=get;update;patch
//+kubebuilder:rbac:groups=apps,resources=replicasets,verbs=list;get;create;watch
//+kubebuilder:rbac:groups=core,resources=persistentvolumeclaims,verbs=list;get;create;watch;delete

Validation and defaults

CRDs provide basic validation, but complex checks and defaulting are often implemented in the Operator or via mutating webhooks.

Testing

Unit tests can cover individual logic units, while integration tests benefit from a real Kubernetes cluster. Tools like kind (Kubernetes in Docker) enable fast, realistic integration testing.

Conclusion

Deploy Operators as Pods in the cluster.

Go offers the most mature ecosystem, though any language can be used.

Handle non‑Kubernetes resources carefully, especially during network failures.

Perform a single action per reconciliation cycle and re‑queue as needed.

Adopt a “condition‑based” approach rather than “edge‑based”.

Use deterministic naming for generated resources.

Grant the ServiceAccount the least privileges required.

Apply defaults both in webhooks and in code.

Use kind for reliable integration testing.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Cloud NativeKubernetesOperatorGoControllerCRD
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.