Mastering Kubernetes Operators: Best Practices for Reliable Cloud‑Native Applications
This article translates and expands Red Hat's guide on Kubernetes Operators, explaining how operators watch resources, handle reconcile cycles, implement watches, validate and initialize custom resources, manage finalizers, ownership, status, and error handling, and provides practical code examples for building production‑ready operators.
Introduction
Kubernetes Operators are processes that connect to the main API server and watch a limited set of resource types. When a watched event occurs, the operator responds and may interact with the API as well as other systems, both inside and outside the cluster.
Operators are collections of controllers, each watching a specific resource type. When the watched resource triggers an event, a reconcile cycle starts.
During a reconcile cycle the controller checks whether the current state matches the desired state described by the watched resource. The design follows a level‑based (horizontal) trigger rather than an edge‑based trigger, ensuring the entire state is considered even in unreliable environments.
The API request lifecycle is illustrated with a high‑level diagram (omitted here). When creating or deleting resources, requests pass through the stages shown, and webhooks can be used for mutation or validation.
Creating Watches
Watches receive events for a specific resource type (core or CRD). To create a watch you typically specify:
The resource type to watch.
A handler that maps events to one or more reconcile instances.
A predicate that filters events of interest.
Example predicate filtering TLS‑type Secret events:
isAnnotatedSecret := predicate.Funcs{
UpdateFunc: func(e event.UpdateEvent) bool {
oldSecret, ok := e.ObjectOld.(*corev1.Secret)
if !ok { return false }
newSecret, ok := e.ObjectNew.(*corev1.Secret)
if !ok { return false }
if newSecret.Type != util.TLSSecret { return false }
oldValue, _ := e.MetaOld.GetAnnotations()[certInfoAnnotation]
newValue, _ := e.MetaNew.GetAnnotations()[certInfoAnnotation]
old := oldValue == "true"
new := newValue == "true"
if !reflect.DeepEqual(newSecret.Data[util.Cert], oldSecret.Data[util.Cert]) ||
!reflect.DeepEqual(newSecret.Data[util.CA], oldSecret.Data[util.CA]) {
return new
}
return old != new
},
CreateFunc: func(e event.CreateEvent) bool {
secret, ok := e.Object.(*corev1.Secret)
if !ok { return false }
if secret.Type != util.TLSSecret { return false }
value, _ := e.Meta.GetAnnotations()[certInfoAnnotation]
return value == "true"
},
}A common pattern is to watch owned resources using handler.EnqueueRequestForOwner:
err = c.Watch(&source.Kind{Type: &examplev1alpha1.MyControlledType{}}, &handler.EnqueueRequestForOwner{})Another pattern propagates a Secret change to multiple Routes that reference it:
type enqueueRequestForReferecingRoutes struct { client.Client }
func (e *enqueueRequestForReferecingRoutes) Create(evt event.CreateEvent, q workqueue.RateLimitingInterface) {
routes, _ := matchSecret(e.Client, types.NamespacedName{Name: evt.Meta.GetName(), Namespace: evt.Meta.GetNamespace()})
for _, route := range routes {
q.Add(reconcile.Request{NamespacedName: types.NamespacedName{Namespace: route.GetNamespace(), Name: route.GetName()}})
}
}
func (e *enqueueRequestForReferecingRoutes) Update(evt event.UpdateEvent, q workqueue.RateLimitingInterface) {
routes, _ := matchSecret(e.Client, types.NamespacedName{Name: evt.MetaNew.GetName(), Namespace: evt.MetaNew.GetNamespace()})
for _, route := range routes {
q.Add(reconcile.Request{NamespacedName: types.NamespacedName{Namespace: route.GetNamespace(), Name: route.GetName()}})
}
}Resource Reconciliation Cycle
The reconcile cycle begins after a watched event is delivered. The framework hands control to the operator, which works without explicit event timestamps, following the level‑based trigger model.
A typical reconcile model includes:
Fetching the CR instance of interest.
Validating the instance.
Initializing missing fields.
Handling deletion via finalizers.
Executing business logic specific to the controller.
Resource Validation
Two validation types are recommended:
Syntax validation via OpenAPI rules.
Semantic validation via a ValidatingAdmissionConfiguration webhook.
Because a CR that reaches etcd cannot be rejected by the controller, it is advisable to also perform semantic checks inside the controller and share validation code between the webhook and the controller.
Syntax Validation
Add OpenAPI validation rules to the CRD definition.
Semantic Validation
Implement custom logic in the operator and expose it through a webhook. In OpenShift 3.11 the feature is preview‑only, and the Operator SDK does not scaffold webhooks; use kubebuilder webhook instead.
kubebuilder webhook --group crew --version v1 --kind FirstMate --type=mutating --operations=create,updateResource Initialization
Initialize all fields in the controller (or via a MutatingAdmissionConfiguration). Example:
if ok := r.IsInitialized(instance); !ok {
err := r.GetClient().Update(context.TODO(), instance)
if err != nil {
log.Error(err, "unable to update instance", "instance", instance)
return r.ManageError(instance, err)
}
return reconcile.Result{}, nil
}Resource Finalization
Use finalizers to perform cleanup before the Kubernetes garbage collector removes a CR. The algorithm checks for the controller's finalizer, runs cleanup logic, removes the finalizer, and updates the resource.
if util.IsBeingDeleted(instance) {
if !util.HasFinalizer(instance, controllerName) { return reconcile.Result{}, nil }
err := r.manageCleanUpLogic(instance)
if err != nil { log.Error(err, "unable to delete instance", "instance", instance); return r.ManageError(instance, err) }
util.RemoveFinalizer(instance, controllerName)
err = r.GetClient().Update(context.TODO(), instance)
if err != nil { log.Error(err, "unable to update instance", "instance", instance); return r.ManageError(instance, err) }
return reconcile.Result{}, nil
}Resource Ownership
Set controller references so that owned resources are deleted automatically when the owner is removed:
controllerutil.SetControllerReference(owner, obj, r.GetScheme())Ownership rules include same‑namespace requirement for parent‑child resources, and that cluster‑scoped resources cannot own namespaced resources.
Status Management
Use the Status sub‑resource to report the result of each reconcile cycle without incrementing metadata.generation:
err = r.Status().Update(context.Background(), instance)A predicate such as GenerationChangePredicate can filter out updates that do not change the generation or finalizers.
type resourceGenerationOrFinalizerChangedPredicate struct { predicate.Funcs }
func (resourceGenerationOrFinalizerChangedPredicate) Update(e event.UpdateEvent) bool {
if e.MetaNew.GetGeneration() == e.MetaOld.GetGeneration() && reflect.DeepEqual(e.MetaNew.GetFinalizers(), e.MetaOld.GetFinalizers()) {
return false
}
return true
}Error Management
When a reconcile returns an error, the operator logs it, records an event, updates the CR status with failure information, and schedules a retry with exponential back‑off (capped at six hours).
func (r *ReconcilerBase) ManageError(obj metav1.Object, issue error) (reconcile.Result, error) {
runtimeObj, ok := obj.(runtime.Object)
if !ok { log.Error(errors.New("not a runtime.Object"), "passed object was not a runtime.Object", "object", obj); return reconcile.Result{}, nil }
var retryInterval time.Duration
r.GetRecorder().Event(runtimeObj, "Warning", "ProcessingError", issue.Error())
if reconcileStatusAware, updateStatus := obj.(apis.ReconcileStatusAware); updateStatus {
lastUpdate := reconcileStatusAware.GetReconcileStatus().LastUpdate.Time
lastStatus := reconcileStatusAware.GetReconcileStatus().Status
status := apis.ReconcileStatus{LastUpdate: metav1.Now(), Reason: issue.Error(), Status: "Failure"}
reconcileStatusAware.SetReconcileStatus(status)
err := r.GetClient().Status().Update(context.Background(), runtimeObj)
if err != nil { log.Error(err, "unable to update status"); return reconcile.Result{RequeueAfter: time.Second, Requeue: true}, nil }
if lastUpdate.IsZero() || lastStatus == "Success" { retryInterval = time.Second } else { retryInterval = status.LastUpdate.Sub(lastUpdate).Round(time.Second) }
} else { log.Info("object is not RecocileStatusAware, not setting status"); retryInterval = time.Second }
return reconcile.Result{RequeueAfter: time.Duration(math.Min(float64(retryInterval.Nanoseconds()*2), float64(time.Hour.Nanoseconds()*6))), Requeue: true}, nil
}Conclusion
The practices described address the most common challenges when building Kubernetes Operators and help you create production‑ready operators. For more comprehensive examples, refer to the operator-utils repository.
Original source: https://cloud.redhat.com/blog/kubernetes-operators-best-practices
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
