Cloud Native 10 min read

How to Stabilize Java Services on Kubernetes: A 3‑Year Success Story

This article walks through a real‑world Java service on Kubernetes, detailing the initial confidence, recurring OOM and rollout issues, and a multi‑round remediation that introduced container‑aware JVM settings, refined resource requests, OOM dumps, probes, and metrics, ultimately achieving three years of stable operation with lower resource usage.

Full-Stack DevOps & Kubernetes
Full-Stack DevOps & Kubernetes
Full-Stack DevOps & Kubernetes
How to Stabilize Java Services on Kubernetes: A 3‑Year Success Story

From "Can Run" to Three Years of Stable Operation

Running Java on Kubernetes often starts with confidence: "Just build an image, write a YAML, start a pod." In production, however, teams encounter a long‑term battle where services suffer OOMKilled, memory spikes without high CPU, rolling‑upgrade 502 errors, and pods that appear alive but are effectively dead.

Real Production Case: A Typical Business‑Middle‑Platform Service

Background:

Spring Boot + JDK 17

Moderate daily request volume with clear peaks

Response‑time (RT) sensitive, especially in the middle of request chains

Batch jobs and promotions amplify load

Standard deployment configuration:

resources:
  requests:
    cpu: "1"
    memory: "2Gi"
  limits:
    cpu: "2"
    memory: "2Gi"

Traditional JVM parameters: -Xms2g -Xmx2g Initially everything worked, but problems emerged over time.

Problem Manifestation

1️⃣ Intermittent OOMKilled

No Java OOM logs appear; only the pod reason shows OOMKilled. The pod restarts and runs again.

2️⃣ RT Jitter at Peak While CPU Stays Low

CPU usage: 40%–50%

Memory near the limit

RT occasionally spikes

Full GC appears sporadically

Teams often mistakenly double the pod replica count, which only raises resource cost without fixing the root cause.

3️⃣ Rolling Upgrade 502 Errors

Half of the pods become Ready, the other half keep restarting

Users see 502 responses

This issue is hard to reproduce in a test environment.

Root Cause Diagnosis

The issue is neither Kubernetes nor Java; the JVM is not being treated as a container‑aware runtime.

In other words, teams continue to apply "bare‑metal JVM thinking" inside a container with strict resource boundaries.

Systematic Remediation

Round 1 – Refactor Resources and JVM Parameters

Remove fixed -Xms2g and -Xmx2g settings.

# Delete
- Xms2g
- Xmx2g

Adopt container‑aware flags:

-XX:+UseContainerSupport
-XX:MaxRAMPercentage=70

Design logic: with a memory limit = 2Gi, allocate roughly Heap ≈ 1.4Gi and keep ~600 Mi for non‑heap and OS.

Round 2 – Precise Resource Requests and Limits

Define requests based on monitoring data:

requests:
  cpu: "800m"
  memory: "2Gi"
limits:
  cpu: "2"
  memory: "2.5Gi"

Steady memory ≈ 1.6 Gi

Peak memory ≈ 1.9 Gi

Non‑heap fluctuation ≈ 300–400 Mi

Round 3 – Make OOM Visible

Enable heap dump on OOM:

-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/heapdumps

Principle: an OOM in production must leave forensic evidence.

Force the JVM to exit immediately on OOM: -XX:+ExitOnOutOfMemoryError This allows Kubernetes to restart the pod automatically.

Round 4 – Refine Probes and Graceful Shutdown

Readiness checks only traffic eligibility.

Liveness checks only restart necessity.

Add a Startup Probe to avoid killing slow‑starting Java processes.

Implement true graceful termination:

Capture SIGTERM Stop accepting new requests

Wait for in‑flight requests to finish

Round 5 – Metric‑Driven JVM Observability

Expose metrics via Micrometer:

Heap / Non‑Heap usage

GC count and duration

Thread count changes

Benefits: OOM becomes predictable, memory leaks show trends, and GC spikes can be detected early.

Round 6 – Re‑thinking Autoscaling

CPU‑only HPA is insufficient for Java workloads. Effective scaling combines:

CPU + RT

Memory trend

Request queue length

Final Outcome

OOM transformed from occasional accidents to predictable events.

Rolling‑upgrade 502 errors disappeared.

Resource utilization dropped by over 20%.

No more midnight OOMKilled wake‑ups.

Key takeaways:

Use a container‑aware JVM.

Do not equate memory limit with JVM‑available memory.

Treat OOM as a system‑design issue, not a random fault.

Kubernetes handles restarts; the JVM should exit on fatal errors.

Stability stems from continuous, systematic governance rather than one‑off tuning.

JavaJVMcloud-nativeoperationsKubernetesOOM
Full-Stack DevOps & Kubernetes
Written by

Full-Stack DevOps & Kubernetes

Focused on sharing DevOps, Kubernetes, Linux, Docker, Istio, microservices, Spring Cloud, Python, Go, databases, Nginx, Tomcat, cloud computing, and related technologies.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.