Cloud Native 13 min read

What 3 Years of Running Kubernetes in Production Taught Us

After three years of operating multiple Kubernetes clusters across bare‑metal and cloud environments, we share hard‑won lessons on Java container compatibility, upgrade strategies, CI/CD redesign, probe tuning, conntrack limits, and evaluating whether Kubernetes truly fits your workload.

Full-Stack DevOps & Kubernetes

Oct 10, 2020

What 3 Years of Running Kubernetes in Production Taught Us

1 Java Application "Pitfalls"

Engineers often avoid Java in containers because of its historically poor memory management, but recent JVM improvements (e.g., XX:+UnlockExperimentalVMOptions and XX:+UseCGroupMemoryLimitForHeap) have mitigated many issues. Early Java 8 workloads crashed due to the JVM's inability to use Linux cgroup and namespace. We now run Java 11+ and allocate an extra 1 GB of Kubernetes memory beyond the JVM -Xmx heap size to provide headroom.

2 Kubernetes Lifecycle Management: Upgrades

In‑place upgrades are cumbersome; the simplest approach is to provision a fresh cluster with the latest version and migrate workloads. Tools like Kubespray, Kubeone, Kops, and Kubeaws help but often require stepping through every minor version. We built our own clusters on RHEL VMs with Kubespray, which offers playbooks for node addition, removal, and upgrades, though its upgrade playbooks enforce sequential version jumps.

3 Build and Deployment Redesign

We re‑architected our CI/CD pipeline, moving from monolithic Jenkins jobs to a Git‑centric workflow using Helm charts. Application code and its Helm chart live in separate Git repositories, enabling independent versioning. Release versions are linked (e.g., app-1.2.0 with charts-1.1.0); patch updates to Helm values only bump the chart patch number. Non‑code system services (Kafka, Redis) use Docker tags as the sole version indicator, and chart major versions are updated when the Docker tag changes.

4 Liveness and Readiness Probes (Double‑Edged Sword)

Probes automatically restart failing containers, but for stateful services like Kafka they can interfere with long start‑up procedures. Our 3‑broker Kafka cluster with ReplicationFactor=3 and minInSyncReplica=2 sometimes needed 10‑30 minutes to rebuild indexes after a crash; aggressive liveness probes would repeatedly kill the pod. The workaround is to increase initialDelaySeconds to give the application enough time, balancing faster recovery against longer failure detection.

Update: Newer Kubernetes releases (1.16 alpha, 1.18 beta) introduce a "startup probe" that disables liveness and readiness checks until the container signals it is ready, preventing premature restarts.

5 Exposing Services via Static External IPs

Using static external IPs incurs heavy conntrack overhead. Our clusters run Calico CNI with BGP routing and iptables‑mode kube‑proxy. Each external connection is tracked via the kernel conntrack table; once the table reaches its limit (e.g., net.netfilter.nf_conntrack_max = 262144), new connections are dropped. Scaling the conntrack table or distributing inbound traffic across edge routers can mitigate this bottleneck.

$ sysctl net.netfilter.nf_conntrack_count
net.netfilter.nf_conntrack_max = 262144

6 Do You Really Need Kubernetes?

Kubernetes brings architectural shifts, operational overhead, and a steep learning curve. Managed services can reduce maintenance burden, but you must assess whether the platform’s benefits outweigh its costs for your specific use case. Adopt Kubernetes only when its features are essential to your workload.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Java conntrack Production

Written by

Full-Stack DevOps & Kubernetes

Focused on sharing DevOps, Kubernetes, Linux, Docker, Istio, microservices, Spring Cloud, Python, Go, databases, Nginx, Tomcat, cloud computing, and related technologies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.