Cloud Native 11 min read

Three Years of Production Kubernetes: Key Lessons and Practical Tips

Over three years of running Kubernetes in production across on‑premise RHEL VMs and AWS EC2, we learned hard‑won lessons about Java container compatibility, upgrade strategies, build and deployment pipelines, probe tuning, external IP scaling, and when Kubernetes truly adds value.

dbaplus Community
dbaplus Community
dbaplus Community
Three Years of Production Kubernetes: Key Lessons and Practical Tips

Strange Java Cases

Initially, Java applications struggled in containers due to JVM incompatibility with Linux cgroups and namespaces, leading to memory and garbage‑collection crashes. Oracle’s later JVM flags (e.g., -XX:+UnlockExperimentalVMOptions and -XX:+UseCGroupMemoryLimitForHeap) improved container compatibility, but Java still lags behind Go or Python in memory footprint and startup speed. For production workloads we now require Java 11+ and allocate an extra 1 GB of Kubernetes memory beyond the JVM max heap ( -Xmx) to provide headroom.

Kubernetes Lifecycle Management: Upgrades

Upgrading an existing on‑premise or VM‑based cluster is cumbersome. The simplest approach is to provision a fresh cluster with the latest version and migrate workloads, rather than performing in‑place node upgrades. Tools such as Kubespray, Kubeone, Kops, and Kubeaws help but each has limitations; for example, Kubespray’s upgrade playbooks require stepping through every intermediate version.

Build and Deploy

We redesigned our CI/CD pipeline around Kubernetes, moving from monolithic Jenkins jobs to Helm‑based deployments, Git‑driven versioning, and Docker image tagging. Application code and its Helm chart live in separate Git repositories, enabling independent semantic versioning. Chart versions are linked to application versions (e.g., app‑1.2.0 uses charts‑1.1.0). For third‑party services like Kafka or Redis, we version only the Helm chart and Docker tag, since we do not modify the source code.

Liveness and Readiness Probes (A Double‑Edged Sword)

Kubernetes probes automatically restart failing containers and route traffic away from unhealthy pods. However, for stateful services such as Kafka, aggressive liveness probes can terminate a pod while it is performing lengthy recovery tasks, causing a restart loop. The recommended mitigation is to increase initialDelaySeconds to give the application sufficient time to become healthy, balancing faster recovery against longer failure detection. From Kubernetes 1.16 (alpha) and 1.18 (beta), a third probe type— startupProbe —disables liveness/readiness checks until the container signals it is ready, preventing premature termination.

Exposing External IP

Publishing services via static external IPs forces the kernel’s conntrack subsystem to track a massive number of flows, quickly hitting its limits ( nf_conntrack_max). In our Calico‑based cluster we observed nf_conntrack_count = 167012 against a max of 262144. When the table fills, new connections are dropped, breaking scalability. Mitigations include peering edge routers across many nodes to distribute inbound connections and enlarging the conntrack table where possible.

Do You Really Need Kubernetes?

After three years we recognize that Kubernetes introduces significant operational overhead—design changes, skill expansion, and team growth. Managed Kubernetes services can offset much of this cost, but organizations must first ask whether the platform solves a concrete problem. If the answer is no, the complexity may outweigh the benefits.

Conclusion

Kubernetes can dramatically boost productivity when its strengths align with your use case, but adopting it solely for technology’s sake is futile. Careful evaluation, incremental migration, and thoughtful configuration of probes, upgrades, and networking are essential for a successful production deployment.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

JavaCloud NativeKubernetesOpsupgradeproductionProbes
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.