Operations 12 min read

Running Kafka on Kubernetes: Practical Tips, Pitfalls, and Best Practices

This guide explains how to run Kafka on Kubernetes, covering runtime resource needs, storage considerations, network requirements, configuration with Pods, StatefulSets, Helm charts and Operators, performance testing, monitoring, logging, health checks, rolling updates, scaling, and backup strategies.

dbaplus Community

Sep 4, 2019

Running Kafka on Kubernetes: Practical Tips, Pitfalls, and Best Practices

Why Kafka on Kubernetes Is Challenging

Kubernetes is optimized for stateless, twelve‑factor workloads, whereas Apache Kafka is a stateful distributed log that behaves like a database. Deploying Kafka on Kubernetes therefore requires careful handling of persistent state, network latency, and resource allocation.

Runtime Resource Considerations

CPU : Kafka brokers are CPU‑efficient. TLS adds overhead only for encrypted client connections; broker CPU usage remains modest.

Memory : Brokers typically run a JVM with a 4‑5 GB heap, but Kafka relies heavily on the OS page cache. Allocate sufficient system memory beyond the heap to accommodate the cache.

Storage : Container‑local storage (e.g., emptyDir) is transient and will lose data on pod restart. Use non‑local persistent block volumes formatted with XFS or ext4. NFS v3/v4 is unsupported because it can trigger “stupid rename” failures that corrupt broker directories.

Network : Low latency and high bandwidth are critical. Avoid placing all brokers on a single node and do not span a Kafka cluster across data‑center boundaries. Prefer separate availability zones within the same Kubernetes cluster.

Configuration Essentials

Pods : Deploy each Kafka broker and each ZooKeeper server in its own pod.

StatefulSet : Use a StatefulSet to guarantee ordered, stable pod identities and stable storage bindings.

Headless Service : Create a headless Service (clusterIP: None) so that each broker receives a stable DNS name (e.g., broker-0.kafka.default.svc.cluster.local) without load‑balancing.

Persistent Volumes : Bind each broker pod to a PersistentVolumeClaim that references a non‑local block storage class.

Example manifest repository: https://github.com/Yolean/kubernetes-kafka

Helm charts (official, Confluent, Bitnami) simplify parameterization of the above resources.

Operators such as Strimzi (https://strimzi.io/) can provision a full Kafka cluster, configure TLS between brokers, and manage topics via custom resources.

Performance Testing

Benchmark the cluster with the built‑in tools:

bin/kafka-producer-perf-test.sh --topic test --num-records 1000000 --record-size 100 --throughput -1
bin/kafka-consumer-perf-test.sh --topic test --messages 1000000 --threads 1

Reference benchmark results are available from Jay Kreps and Stéphane Maarek.

Operations and Maintenance

Monitoring : Deploy Prometheus with the JMX exporter to scrape Kafka, ZooKeeper, and Kafka Connect metrics. Add cAdvisor for node‑level resource metrics. Strimzi provides a ready‑made Grafana dashboard.

Logging : Configure containers to write logs to stdout / stderr and forward them to a central log store (e.g., Elasticsearch).

Health Checks : Define liveness and readiness probes for each broker pod so Kubernetes can restart unhealthy pods and exclude unready pods from service endpoints.

Rolling Updates : Use the StatefulSet rolling‑update strategy; pods are updated one at a time, enabling zero‑downtime upgrades.

Scaling : Adjust the replica count in the StatefulSet to add or remove brokers. After scaling, manually trigger partition rebalancing (e.g., using the kafka-reassign-partitions.sh tool).

Topic Management : Prefer the Strimzi Topic custom resource for creating, deleting, and reassigning topics instead of ad‑hoc shell scripts.

Backup & Restore : Kafka’s durability depends on the underlying Kubernetes cluster. Use MirrorMaker for cross‑cluster replication or S3‑based snapshots (see Zalando backup guide) to protect against cluster loss.

Conclusion

For small‑ to medium‑size deployments, Kubernetes provides flexibility and operational simplicity for Kafka. Extremely low‑latency or ultra‑high‑throughput workloads may still benefit from dedicated bare‑metal or VM‑based deployments.

References

Kafka NFS issues: https://engineering.skybettingandgaming.com/2018/07/10/kafka-nfs/

Kubernetes ZooKeeper tutorial: https://kubernetes.io/docs/tutorials/stateful-application/zookeeper/

Yolean Kafka manifest: https://github.com/Yolean/kubernetes-kafka

Helm Kafka chart: https://github.com/helm/charts/tree/master/incubator/kafka

Strimzi Operator: https://strimzi.io/

LinkedIn Kafka benchmarking: https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines

Review of Amazon MSK: https://medium.com/@stephane.maarek/an-honest-review-of-aws-managed-apache-kafka-amazon-msk-94b1ff9459d8

Kafka‑Monitor: https://github.com/linkedin/kafka-monitor

Zalando backup guide: https://jobs.zalando.com/tech/blog/backing-up-kafka-zookeeper/

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Kubernetes Kafka Ops stateful applications helm operators

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.