Why Did My Tomcat Crash at 750 TPS? Uncovering the OOM Killer Mystery
A detailed case study reveals how an out‑of‑memory (OOM) situation on a Linux server caused Tomcat to crash during load testing, explains the kernel's OOM killer behavior, and provides practical solutions such as enabling swap and tuning Apache limits.
Background
The project architecture consists mainly of a front‑end interactive experience and object storage, with Redis and RDS omitted for brevity. web1 and web2 are two Apache servers, while publisher1 and publisher2 run Tomcat application servers.
During a load test, when concurrency reached about 750, the Alibaba Cloud PTS tool reported a drop in TPS and a rise in response time (RT).
Real‑time monitoring showed that the CPU utilization of publisher1 dropped to 1.9%, indicating the node had likely crashed. Network traffic for that node disappeared, and an SSH login confirmed that the Tomcat process had stopped, yet no Tomcat or application logs recorded any error.
The Tomcat JVM was started with JAVA_OPTS="-Xmx3072m" on a machine that has 8 GB of physical RAM, which seemed sufficient at first glance.
Root Cause Analysis
Typical Tomcat crashes due to memory overflow leave traces in container or crash logs, but in this case the Linux kernel killed the process because the system ran out of physical memory. When memory is exhausted, the kernel’s OOM killer selects the biggest memory consumer—in this case Tomcat—and terminates it without giving the JVM a chance to write logs.
Examining /var/log/message at the reported time reveals OOM‑killer entries.
The timestamps (UTC) correspond to the peak memory usage shown in the monitoring graphs; around 11:15 UTC the 8 GB of physical memory on publisher1 was fully consumed.
Solution
When physical memory is insufficient, Linux swaps out inactive pages to the swap partition. In this environment the swap space was zero, so the kernel could not offload memory and the OOM killer terminated Tomcat. Enabling swap or adding more RAM prevents the crash.
For the Apache servers, reducing MaxClients and ServerLimit, then restarting httpd, also mitigates OOM risk.
Additional Case: Apache Node Failure
After fixing publisher1 , further testing at 1000 concurrent users caused web2 (Apache) to become unreachable. The system logs showed a similar OOM event, and the Apache worker process was automatically restarted, but the physical memory was exhausted.
The remedy mirrors the Tomcat case: enable swap for the Apache node and tune the connection limits.
References
Tomcat stopped without any log or any stack Out of Memory + httpd.worker invoked oom‑killer
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
