Cloud Native 18 min read

Design and Implementation of a Zookeeper Operator for Kubernetes

This article outlines the design, functional requirements, CRD definition, architecture, deployment, scaling, monitoring, fault‑tolerance, and upgrade strategies of a Zookeeper operator on Kubernetes, including code examples, service configurations, and integration with Prometheus and OAM standards.

Manbang Technology Team
Manbang Technology Team
Manbang Technology Team
Design and Implementation of a Zookeeper Operator for Kubernetes

Introduction In 2018 at KubeCon, Alibaba’s Chen Jun introduced the concept of a Node Operator, inspiring the development of a Zookeeper Operator to containerize NoSQL components and manage their lifecycle on Kubernetes.

Functional Requirements The operator must provide rapid deployment, secure scaling, automated monitoring, self‑healing, and visual operation capabilities.

CRD Definition The first step is defining a declarative Item spec that includes node resources, monitoring components, replica count, and persistent storage.

Architecture

Deploy : Generates native resources such as StatefulSet, Service, ConfigMap, and PersistentVolume for fast Zookeeper cluster deployment.

Monitor : Creates ServiceMonitor and PrometheusRule resources to register the cluster with Prometheus and set alerting policies.

Scale : Controls scaling and rolling upgrades, ensuring minimal master‑slave switches during restarts.

CRD Example

apiVersion: database.ymm-inc.com/v1beta1</code>
<code>kind: ZooKeeper</code>
<code>metadata:</code>
<code>  name: zookeeper-sample</code>
<code>spec:</code>
<code>  version: v3.5.6</code>
<code>  cluster:</code>
<code>    name: test</code>
<code>    resources:</code>
<code>      requests:</code>
<code>        cpu: 1000m</code>
<code>        memory: 2Gi</code>
<code>      limits:</code>
<code>        cpu: 2000m</code>
<code>        memory: 2Gi</code>
<code>    exporter:</code>
<code>      exporter: true</code>
<code>      exporterImage: harbor.ymmoa.com/monitoring/zookeeper_exporter</code>
<code>      exporterVersion: v3.5.6</code>
<code>    nodeCount: 3</code>
<code>    storage:</code>
<code>      size: 100Gi

Deployment Details

Labels applied to the StatefulSet and Service for selection and monitoring:

labels:</code>
<code>  app: zookeeper</code>
<code>  app.kubernetes.io/instance: zookeeper-sample</code>
<code>  component: zookeeper</code>
<code>  zookeeper: zookeeper-sample

InitContainer copies the Zookeeper configuration file into the pod’s working directory.

Main Containers include the Zookeeper process, a monitoring sidecar (exporter), and an agent container for health checks.

Environment Variables such as POD_IP, POD_NAME, and ZK_SERVER_HEAP are injected from the pod spec.

Readiness Probe uses the ruok command to verify the node is ready before updating the dynamic configuration file.

Monitoring Integration

ServiceMonitor registers the exporter port http-metrics with Prometheus:

apiVersion: monitoring.coreos.com/v1</code>
<code>kind: ServiceMonitor</code>
<code>metadata:</code>
<code>  labels:</code>
<code>    app: zookeeper</code>
<code>    component: zookeeper</code>
<code>spec:</code>
<code>  endpoints:</code>
<code>  - interval: 30s</code>
<code>    port: http-metrics

PrometheusRule creates alerting policies, e.g., sending alerts to a DingTalk robot.

Scaling and Upgrade Strategy

Scaling updates spec.cluster.nodeCount in the Zookeeper CR and triggers the operator to add or remove nodes using the Zookeeper reconfiguration API.

Rolling upgrades are performed by updating the StatefulSet with an OnDelete strategy; the operator deletes pods in a controlled order, respecting MaxUnavailable and leader election.

Partitioned rolling updates allow selective pod replacement based on an index, ensuring minimal disruption.

Agent Sidecar API /status – returns Zookeeper node metrics (sent/received, latency, mode, version, etc.). /runok – checks if the node is running without errors. /health – health check for the agent itself. /get – retrieves the current dynamic configuration. /add and /del – add or remove cluster members via Zookeeper reconfigure.

OAM Integration The operator aligns with the Open Application Model (OAM) by defining reusable Components (e.g., the Zookeeper workload) and Traits (e.g., scaling and rolling‑update CRDs), enabling platform‑agnostic application description and management.

Conclusion The Zookeeper operator demonstrates a cloud‑native approach to managing stateful services on Kubernetes, providing deployment, scaling, monitoring, fault‑tolerance, and upgrade capabilities, while offering extensibility for future features such as backup, migration, and advanced scheduling.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringCloud NativeKubernetesOperatorZooKeeperscalingCRD
Manbang Technology Team
Written by

Manbang Technology Team

Manbang Technology Team

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.