Design and Implementation of a Zookeeper Operator for Kubernetes
This article outlines the design, functional requirements, CRD definition, architecture, deployment, scaling, monitoring, fault‑tolerance, and upgrade strategies of a Zookeeper operator on Kubernetes, including code examples, service configurations, and integration with Prometheus and OAM standards.
Introduction In 2018 at KubeCon, Alibaba’s Chen Jun introduced the concept of a Node Operator, inspiring the development of a Zookeeper Operator to containerize NoSQL components and manage their lifecycle on Kubernetes.
Functional Requirements The operator must provide rapid deployment, secure scaling, automated monitoring, self‑healing, and visual operation capabilities.
CRD Definition The first step is defining a declarative Item spec that includes node resources, monitoring components, replica count, and persistent storage.
Architecture
Deploy : Generates native resources such as StatefulSet, Service, ConfigMap, and PersistentVolume for fast Zookeeper cluster deployment.
Monitor : Creates ServiceMonitor and PrometheusRule resources to register the cluster with Prometheus and set alerting policies.
Scale : Controls scaling and rolling upgrades, ensuring minimal master‑slave switches during restarts.
CRD Example
apiVersion: database.ymm-inc.com/v1beta1</code>
<code>kind: ZooKeeper</code>
<code>metadata:</code>
<code> name: zookeeper-sample</code>
<code>spec:</code>
<code> version: v3.5.6</code>
<code> cluster:</code>
<code> name: test</code>
<code> resources:</code>
<code> requests:</code>
<code> cpu: 1000m</code>
<code> memory: 2Gi</code>
<code> limits:</code>
<code> cpu: 2000m</code>
<code> memory: 2Gi</code>
<code> exporter:</code>
<code> exporter: true</code>
<code> exporterImage: harbor.ymmoa.com/monitoring/zookeeper_exporter</code>
<code> exporterVersion: v3.5.6</code>
<code> nodeCount: 3</code>
<code> storage:</code>
<code> size: 100GiDeployment Details
Labels applied to the StatefulSet and Service for selection and monitoring:
labels:</code>
<code> app: zookeeper</code>
<code> app.kubernetes.io/instance: zookeeper-sample</code>
<code> component: zookeeper</code>
<code> zookeeper: zookeeper-sampleInitContainer copies the Zookeeper configuration file into the pod’s working directory.
Main Containers include the Zookeeper process, a monitoring sidecar (exporter), and an agent container for health checks.
Environment Variables such as POD_IP, POD_NAME, and ZK_SERVER_HEAP are injected from the pod spec.
Readiness Probe uses the ruok command to verify the node is ready before updating the dynamic configuration file.
Monitoring Integration
ServiceMonitor registers the exporter port http-metrics with Prometheus:
apiVersion: monitoring.coreos.com/v1</code>
<code>kind: ServiceMonitor</code>
<code>metadata:</code>
<code> labels:</code>
<code> app: zookeeper</code>
<code> component: zookeeper</code>
<code>spec:</code>
<code> endpoints:</code>
<code> - interval: 30s</code>
<code> port: http-metricsPrometheusRule creates alerting policies, e.g., sending alerts to a DingTalk robot.
Scaling and Upgrade Strategy
Scaling updates spec.cluster.nodeCount in the Zookeeper CR and triggers the operator to add or remove nodes using the Zookeeper reconfiguration API.
Rolling upgrades are performed by updating the StatefulSet with an OnDelete strategy; the operator deletes pods in a controlled order, respecting MaxUnavailable and leader election.
Partitioned rolling updates allow selective pod replacement based on an index, ensuring minimal disruption.
Agent Sidecar API /status – returns Zookeeper node metrics (sent/received, latency, mode, version, etc.). /runok – checks if the node is running without errors. /health – health check for the agent itself. /get – retrieves the current dynamic configuration. /add and /del – add or remove cluster members via Zookeeper reconfigure.
OAM Integration The operator aligns with the Open Application Model (OAM) by defining reusable Components (e.g., the Zookeeper workload) and Traits (e.g., scaling and rolling‑update CRDs), enabling platform‑agnostic application description and management.
Conclusion The Zookeeper operator demonstrates a cloud‑native approach to managing stateful services on Kubernetes, providing deployment, scaling, monitoring, fault‑tolerance, and upgrade capabilities, while offering extensibility for future features such as backup, migration, and advanced scheduling.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
