How to Stabilize Java Services on Kubernetes: A 3‑Year Success Story
This article walks through a real‑world Java service on Kubernetes, detailing the initial confidence, recurring OOM and rollout issues, and a multi‑round remediation that introduced container‑aware JVM settings, refined resource requests, OOM dumps, probes, and metrics, ultimately achieving three years of stable operation with lower resource usage.
From "Can Run" to Three Years of Stable Operation
Running Java on Kubernetes often starts with confidence: "Just build an image, write a YAML, start a pod." In production, however, teams encounter a long‑term battle where services suffer OOMKilled, memory spikes without high CPU, rolling‑upgrade 502 errors, and pods that appear alive but are effectively dead.
Real Production Case: A Typical Business‑Middle‑Platform Service
Background:
Spring Boot + JDK 17
Moderate daily request volume with clear peaks
Response‑time (RT) sensitive, especially in the middle of request chains
Batch jobs and promotions amplify load
Standard deployment configuration:
resources:
requests:
cpu: "1"
memory: "2Gi"
limits:
cpu: "2"
memory: "2Gi"Traditional JVM parameters: -Xms2g -Xmx2g Initially everything worked, but problems emerged over time.
Problem Manifestation
1️⃣ Intermittent OOMKilled
No Java OOM logs appear; only the pod reason shows OOMKilled. The pod restarts and runs again.
2️⃣ RT Jitter at Peak While CPU Stays Low
CPU usage: 40%–50%
Memory near the limit
RT occasionally spikes
Full GC appears sporadically
Teams often mistakenly double the pod replica count, which only raises resource cost without fixing the root cause.
3️⃣ Rolling Upgrade 502 Errors
Half of the pods become Ready, the other half keep restarting
Users see 502 responses
This issue is hard to reproduce in a test environment.
Root Cause Diagnosis
The issue is neither Kubernetes nor Java; the JVM is not being treated as a container‑aware runtime.
In other words, teams continue to apply "bare‑metal JVM thinking" inside a container with strict resource boundaries.
Systematic Remediation
Round 1 – Refactor Resources and JVM Parameters
Remove fixed -Xms2g and -Xmx2g settings.
# Delete
- Xms2g
- Xmx2gAdopt container‑aware flags:
-XX:+UseContainerSupport
-XX:MaxRAMPercentage=70Design logic: with a memory limit = 2Gi, allocate roughly Heap ≈ 1.4Gi and keep ~600 Mi for non‑heap and OS.
Round 2 – Precise Resource Requests and Limits
Define requests based on monitoring data:
requests:
cpu: "800m"
memory: "2Gi"
limits:
cpu: "2"
memory: "2.5Gi"Steady memory ≈ 1.6 Gi
Peak memory ≈ 1.9 Gi
Non‑heap fluctuation ≈ 300–400 Mi
Round 3 – Make OOM Visible
Enable heap dump on OOM:
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/heapdumpsPrinciple: an OOM in production must leave forensic evidence.
Force the JVM to exit immediately on OOM: -XX:+ExitOnOutOfMemoryError This allows Kubernetes to restart the pod automatically.
Round 4 – Refine Probes and Graceful Shutdown
Readiness checks only traffic eligibility.
Liveness checks only restart necessity.
Add a Startup Probe to avoid killing slow‑starting Java processes.
Implement true graceful termination:
Capture SIGTERM Stop accepting new requests
Wait for in‑flight requests to finish
Round 5 – Metric‑Driven JVM Observability
Expose metrics via Micrometer:
Heap / Non‑Heap usage
GC count and duration
Thread count changes
Benefits: OOM becomes predictable, memory leaks show trends, and GC spikes can be detected early.
Round 6 – Re‑thinking Autoscaling
CPU‑only HPA is insufficient for Java workloads. Effective scaling combines:
CPU + RT
Memory trend
Request queue length
Final Outcome
OOM transformed from occasional accidents to predictable events.
Rolling‑upgrade 502 errors disappeared.
Resource utilization dropped by over 20%.
No more midnight OOMKilled wake‑ups.
Key takeaways:
Use a container‑aware JVM.
Do not equate memory limit with JVM‑available memory.
Treat OOM as a system‑design issue, not a random fault.
Kubernetes handles restarts; the JVM should exit on fatal errors.
Stability stems from continuous, systematic governance rather than one‑off tuning.
Full-Stack DevOps & Kubernetes
Focused on sharing DevOps, Kubernetes, Linux, Docker, Istio, microservices, Spring Cloud, Python, Go, databases, Nginx, Tomcat, cloud computing, and related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
