Operations 8 min read

Root Cause Analysis of Java Process Termination Caused by Linux OOM Killer Triggered by Vim Opening a 37 GB Log File

The Java service crashed when the Linux OOM killer terminated its process after a developer opened a 37 GB log file with Vim, which loaded the entire file into the container’s 8 GB memory, triggering a port alarm and prompting investigation and recommendations to use streaming tools instead of Vim.

Didi Tech
Didi Tech
Didi Tech
Root Cause Analysis of Java Process Termination Caused by Linux OOM Killer Triggered by Vim Opening a 37 GB Log File

A port alarm was triggered at 15:19, indicating that the Java process for the service had disappeared. The investigation timeline shows the steps taken to isolate and resolve the issue.

Timeline

15:19 Received port anomaly alarm (P1 fault) for application port 8989.

状态:P1故障
名称:应用端口8989
指标:data-stream-openapi.port.8989
主机:data-stream-openapi-nmg-sf-a9457-1.docker.nmg01
节点:hbb-v.data-stream-openapi.data-stream.datadream.didi.com
当前值:0.00
说明:happen(data-stream-openapi.port.8989,#12,12) = 0
故障时间:2022-11-15 15:21:10

After confirming the Java process was missing, the container traffic was cut off (15:23), the container was rebuilt (15:24‑15:26), and the service returned to normal.

Investigation Path

The team first ruled out deployment issues and checked container performance metrics. Memory usage spiked at 15:19 and dropped quickly, prompting a check for Full GC, which was absent. YGC activity was normal except for a brief rise at 15:25 due to container restart. CPU usage showed only a short increase at 15:25.

Application‑level monitoring showed that the success rate of the API endpoints did not drop during the alarm period, indicating the problem was not in the application itself.

Root Cause Identification

Running dmesg -T | grep java revealed that the Java process (PID 1155805) was killed at 15:17 by the Linux OOM killer due to out‑of‑memory conditions. The port alarm at 15:19 was a downstream effect of the killed process.

A colleague reported that while inspecting the Nginx log with vim , the editor itself reported an error. The log file size was 37 GB, far exceeding the container’s 8 GB memory limit. Opening such a large file caused Vim to read the entire file into memory via its readfile function, exhausting the available RAM.

Principle Analysis

Two questions were examined:

Does Vim load the whole file into memory? – Yes. Vim’s readfile function ultimately calls the system read call and reads the entire file, which can consume massive memory for very large files.

How does the Linux OOM killer decide which process to kill? – When memory is critically low, the kernel scores processes based on factors such as memory consumption, oom_score_adj, and other heuristics. The process with the highest score is terminated to protect the system. In this case, the Java process received a higher score than Vim, so it was killed.

// Example excerpt from the OOM killer source comment
* If we run out of memory, we have the choice between either
* killing a random task (bad), letting the system crash (worse)
* OR try to be smart about which process to kill. Note that we
* don't have to be perfect here, we just have to be good.
你的进程被Linux杀掉几个可能的原因:
一种是内存泄露;
一种是你的进程所需要的内存资源太大,系统无法满足,应该在设计时对进程需要的资源有个最大限制,不能让他无限增长;
当然,也不一定全是你的问题,也有可能是同一主机的其他进程占用资源过多,但是Linux OOM选择“最坏”进程杀掉的算法是很简单粗暴的,就选中你的进程杀掉,也是有可能的

Operational Recommendations

When inspecting large logs, prefer tools that stream data (e.g., less , grep , tail ) instead of vim . If vim must be used, always verify the file size beforehand to avoid OOM situations.

Javaperformance monitoringLinuxContainerVimIncident PostmortemOOM Killer
Didi Tech
Written by

Didi Tech

Official Didi technology account

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.