Investigation of Java Service Crashes at Midnight Due to Cron and Open Files Limit in CentOS Containers
The article analyzes why a Java service repeatedly crashes around midnight in test environments, tracing the issue through system limits, Java version checks, cron job execution, strace logs, and Linux OOM killer behavior, and finally proposes configuration and version upgrades to prevent the failures.
Users reported that a Java service in the test environment consistently terminated around 00:00 despite having no scheduled tasks, no traffic spikes, and reasonable JVM settings. The investigation began by reproducing the problem with a minimal Spring Boot "hello world" WAR deployed on a Tomcat base image (base_tomcat/java-centos6-jdk18-60-tom8050-ngx197, Java 1.8.0_60).
Initial suspicion fell on Linux limit settings. The ulimit -n (open files) values of the failing containers were examined and found to be unusually high.
Testing the limit hypothesis
prlimit -p 32672 --nofile=1048576Even after adjusting the limit to match a healthy machine, the Java process still died at midnight, indicating that open files limits were not the direct cause.
Java version check
Consulting the JDOS R&D team suggested that the low Java version might allow excessive memory allocation. A reference article ( Docker support in new Java 8 ) explained that Docker cgroup memory limits could trigger JVM termination, and newer Java versions mitigate this.
An experiment using Java 11.0.8 showed the same crash behavior, ruling out the Java version as the root cause.
Cron job investigation
Since the base image includes system cron tasks, the team inspected /etc/crontab and identified a logrotate.sh script scheduled at the same time as the crashes. Modifying the cron schedule to 11:00 and capturing a strace trace confirmed that the Java process terminated when the cron job ran.
19:59:01 close(3) = 0
19:59:01 stat("/etc/pam.d", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
... (truncated for brevity) ...
19:59:06 +++ killed by SIGKILL +++The trace showed a massive mmap allocation of 4 GB just before the process was killed, indicating an OOM situation triggered by the cron task.
Understanding the OOM killer
The Linux OOM killer selects a process to terminate based on memory usage, OOM score, priority, and other attributes. In this case, the cron child process caused a rapid memory spike that exceeded container limits, leading the kernel to kill both the cron child and the Java process.
Later versions of the cronie package (≥ 1.5.7‑5) fix the bug where the cron daemon clears memory according to the open‑files limit before sending mail, preventing the excessive allocation.
Solution
Upgrade the base image to a newer, stable CentOS version (e.g., 6.10 or 7.9) where the issue does not occur.
Set a reasonable limit open files value for containers.
For application_worker type services, adjust the limit in the startup script; for web_tomcat services, consider removing or disabling the problematic cron task.
Upgrade cronie to version 1.5.7‑5 or later (check with rpm -q cronie ).
Finally, the article emphasizes that improper open files limits combined with cron tasks can cause severe memory OOM events, and recommends verifying container OS versions and limit settings before deployment.
JD Retail Technology
Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.