Running Kafka on Kubernetes: Practical Tips, Pitfalls, and Best Practices
This guide explains how to run Kafka on Kubernetes, covering runtime resource needs, storage considerations, network requirements, configuration with Pods, StatefulSets, Helm charts and Operators, performance testing, monitoring, logging, health checks, rolling updates, scaling, and backup strategies.
Why Kafka on Kubernetes Is Challenging
Kubernetes is optimized for stateless, twelve‑factor workloads, whereas Apache Kafka is a stateful distributed log that behaves like a database. Deploying Kafka on Kubernetes therefore requires careful handling of persistent state, network latency, and resource allocation.
Runtime Resource Considerations
CPU : Kafka brokers are CPU‑efficient. TLS adds overhead only for encrypted client connections; broker CPU usage remains modest.
Memory : Brokers typically run a JVM with a 4‑5 GB heap, but Kafka relies heavily on the OS page cache. Allocate sufficient system memory beyond the heap to accommodate the cache.
Storage : Container‑local storage (e.g., emptyDir) is transient and will lose data on pod restart. Use non‑local persistent block volumes formatted with XFS or ext4. NFS v3/v4 is unsupported because it can trigger “stupid rename” failures that corrupt broker directories.
Network : Low latency and high bandwidth are critical. Avoid placing all brokers on a single node and do not span a Kafka cluster across data‑center boundaries. Prefer separate availability zones within the same Kubernetes cluster.
Configuration Essentials
Pods : Deploy each Kafka broker and each ZooKeeper server in its own pod.
StatefulSet : Use a StatefulSet to guarantee ordered, stable pod identities and stable storage bindings.
Headless Service : Create a headless Service (clusterIP: None) so that each broker receives a stable DNS name (e.g., broker-0.kafka.default.svc.cluster.local) without load‑balancing.
Persistent Volumes : Bind each broker pod to a PersistentVolumeClaim that references a non‑local block storage class.
Example manifest repository: https://github.com/Yolean/kubernetes-kafka
Helm charts (official, Confluent, Bitnami) simplify parameterization of the above resources.
Operators such as Strimzi (https://strimzi.io/) can provision a full Kafka cluster, configure TLS between brokers, and manage topics via custom resources.
Performance Testing
Benchmark the cluster with the built‑in tools:
bin/kafka-producer-perf-test.sh --topic test --num-records 1000000 --record-size 100 --throughput -1
bin/kafka-consumer-perf-test.sh --topic test --messages 1000000 --threads 1Reference benchmark results are available from Jay Kreps and Stéphane Maarek.
Operations and Maintenance
Monitoring : Deploy Prometheus with the JMX exporter to scrape Kafka, ZooKeeper, and Kafka Connect metrics. Add cAdvisor for node‑level resource metrics. Strimzi provides a ready‑made Grafana dashboard.
Logging : Configure containers to write logs to stdout / stderr and forward them to a central log store (e.g., Elasticsearch).
Health Checks : Define liveness and readiness probes for each broker pod so Kubernetes can restart unhealthy pods and exclude unready pods from service endpoints.
Rolling Updates : Use the StatefulSet rolling‑update strategy; pods are updated one at a time, enabling zero‑downtime upgrades.
Scaling : Adjust the replica count in the StatefulSet to add or remove brokers. After scaling, manually trigger partition rebalancing (e.g., using the kafka-reassign-partitions.sh tool).
Topic Management : Prefer the Strimzi Topic custom resource for creating, deleting, and reassigning topics instead of ad‑hoc shell scripts.
Backup & Restore : Kafka’s durability depends on the underlying Kubernetes cluster. Use MirrorMaker for cross‑cluster replication or S3‑based snapshots (see Zalando backup guide) to protect against cluster loss.
Conclusion
For small‑ to medium‑size deployments, Kubernetes provides flexibility and operational simplicity for Kafka. Extremely low‑latency or ultra‑high‑throughput workloads may still benefit from dedicated bare‑metal or VM‑based deployments.
References
Kafka NFS issues: https://engineering.skybettingandgaming.com/2018/07/10/kafka-nfs/
Kubernetes ZooKeeper tutorial: https://kubernetes.io/docs/tutorials/stateful-application/zookeeper/
Yolean Kafka manifest: https://github.com/Yolean/kubernetes-kafka
Helm Kafka chart: https://github.com/helm/charts/tree/master/incubator/kafka
Strimzi Operator: https://strimzi.io/
LinkedIn Kafka benchmarking: https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines
Review of Amazon MSK: https://medium.com/@stephane.maarek/an-honest-review-of-aws-managed-apache-kafka-amazon-msk-94b1ff9459d8
Kafka‑Monitor: https://github.com/linkedin/kafka-monitor
Zalando backup guide: https://jobs.zalando.com/tech/blog/backing-up-kafka-zookeeper/
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
