Why Do Linux Processes Vanish? A Complete Troubleshooting Guide
This article systematically explains why Linux processes may disappear, covering OOM Killer, signal termination, cgroup limits, systemd timeouts, manual kills, and provides step‑by‑step diagnostic commands and preventive measures for RHEL, AlmaLinux, and Ubuntu environments.
1. Common Reasons for Process Disappearance
Processes may disappear because they exit normally or are terminated by a signal. Normal exits leave logs, while signal‑based termination (e.g., SIGKILL, SIGTERM, SIGSEGV, SIGABRT, SIGBUS) is more common. Special cases include OOM Killer, cgroup limits, systemd TimeoutStopSec, PID/thread limits, and manual kills.
2. Confirm the Process Is Really Gone
2.1 Check Whether the Process Is Still Running
# Check if a process with PID 12345 exists
ps -p 12345
# Or use kill -0 (no signal)
kill -0 12345 # Returns "No such process" if it does not exist
# Search by name
ps aux | grep nginx
ps -ef | grep mysqld2.2 Check for Zombie Processes
A zombie has exited but not been reaped by its parent, still holding a PID.
# Find zombie processes
ps aux | grep Z
# Show detailed state
ps -eo pid,ppid,state,comm | grep ZZombies do not consume memory but occupy PID space; they disappear when the parent calls wait() or is restarted.
2.3 Inspect the Process Exit Code
If the process has just exited, its status can be read from /proc:
# Show exit state and memory usage shortly after termination
cat /proc/PID/status | grep -E "State|VmRSS"
# Look for OOM Killer records in the kernel ring buffer
dmesg | grep -i "out of memory"
dmesg | grep -i "killed process"3. Investigating OOM Killer
3.1 How OOM Killer Works
When physical RAM and swap are exhausted, the kernel invokes OOM Killer. It calculates an oom_score for each process (adjustable via oom_score_adj) – higher scores increase the chance of being killed. The range of oom_score_adj is –1000 to +1000; –1000 makes a process immune, but using it on non‑critical services can jeopardize system stability.
3.2 Find OOM Records in dmesg
# Search for OOM‑related messages
dmesg | grep -i "out of memory"
dmesg | grep -i "killed process"
dmesg | grep -i "oom"
# Show timestamps
dmesg -T | grep -i "oom"Example output:
[Mon Apr 14 10:30:45 2026] Out of memory: Killed process 12345 (myapp) total-vm:2048000kB, anon-rss:1024000kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:2048kB oom_score_adj:03.3 Review Historical Memory Usage
# Quick memory overview
free -h
# Detailed memory statistics
cat /proc/meminfo | grep -E "MemTotal|MemFree|MemAvailable|Buffers|Cached|SwapTotal|SwapFree|AnonPages|Shmem"
# Top memory‑hungry processes
ps aux --sort=-rss | head -20If MemAvailable is near zero and SwapFree is also zero, the system truly ran out of memory. If MemAvailable is still sufficient, the OOM may have been triggered by a cgroup limit.
3.4 Correlate OOM Time with Process Disappearance
Match the timestamp of the OOM record with the moment the process vanished, then investigate why memory was exhausted (e.g., leak or traffic spike).
# Monitor a specific PID's memory growth (requires monitoring tools)
while true; do
ps -p 12345 -o pid,vsz,rss,pmem,comm
sleep 5
done3.5 Adjust OOM Killer Policy
Check and Change oom_score_adj
# View current adjustment
cat /proc/PID/oom_score_adj
# Temporary change (lost after restart)
echo -1000 > /proc/PID/oom_score_adj
# Permanent change via systemd
systemctl set-environment OOM_SCORE_ADJ=-1000
# Or add to the service unit file
[Service]
Environment=OOM_SCORE_ADJ=-1000Warning: Setting oom_score_adj to –1000 on a non‑critical process can cause the whole system to hang if memory runs out.
4. Investigating cgroup Resource Limits
4.1 cgroup v2 Limits
# Show the cgroup hierarchy for a PID
cat /proc/PID/cgroup
# Show memory limits (v2)
cat /sys/fs/cgroup/memory.max
cat /sys/fs/cgroup/memory.current
# For cgroup v1
cat /sys/fs/cgroup/memory/memory.limit_in_bytes
cat /sys/fs/cgroup/memory/memory.usage_in_bytesIf memory.current is close to memory.max, the cgroup limit caused termination.
4.2 Docker Container Limits
# Inspect container memory settings
docker inspect container_name | grep -A 10 "Memory"
# Show the container's cgroup
docker exec container_name cat /proc/1/cgroup
# Check if the container's OOM Killer fired
docker logs container_name 2>&1 | grep -i oomDocker has its own OOM handling based on the cgroup configuration; it is independent of the host's OOM Killer.
4.3 Kubernetes Pod Limits
# Show pod resource limits
kubectl describe pod pod_name | grep -A 5 "Limits"
# Show node resource allocation
kubectl describe node node_name | grep -A 10 "Allocated resources"If a pod exceeds its limits, cgroup kills it and the exit code becomes 137 (128 + 9).
5. Investigating Signal‑Based Termination
5.1 Examine Signals Received by the Process
# Show signal‑related fields from /proc
cat /proc/PID/status | grep -E "State|SigPnd|SigBlk|SigIgn|SigCgt"
# Show pending signals (hexadecimal mask)
cat /proc/PID/status | grep "SigQ:"5.2 Common Signals
SIGKILL (9) : cannot be caught or ignored; often originates from manual kill -9, systemd timeout, or cgroup limit.
SIGTERM (15) : can be caught; used for graceful shutdowns (manual kill without -9, systemd stop, container stop).
SIGSEGV (11) : segmentation fault caused by invalid memory access; a code bug.
SIGABRT (6) : generated by abort(), usually from failed assertions.
SIGBUS (7) : bus error, often due to misaligned memory access.
SIGFPE (8) : floating‑point exception (divide‑by‑zero, overflow).
5.3 Analyse Core Dumps
# Verify core dump is enabled
ulimit -c
# Enable unlimited core size
ulimit -c unlimited
# Persistently enable in /etc/profile
echo "ulimit -c unlimited" >> /etc/profileCore files are created as core.PID in the working directory unless /proc/sys/kernel/core_pattern is changed.
# Debug with gdb
gdb /path/to/binary /path/to/corefile
# Or use crash for kernel dumps
crash /usr/lib/debug/vmlinuz-$(uname -r) /path/to/corefile6. Investigating systemd Service Termination
6.1 View systemd Logs
# Full service log (last 100 lines)
journalctl -u service_name -n 100
# Log from previous boot
journalctl -u service_name -b -1
# Logs from the last hour
journalctl -u service_name --since "1 hour ago"
# Search for OOM entries
journalctl -k | grep -i oom6.2 Check Service Unit Configuration
[Service]
Type=simple
ExecStart=/usr/local/bin/myapp
TimeoutStartSec=60
TimeoutStopSec=30 # After 30 s without graceful exit, systemd sends SIGKILL
Restart=on-failure
RestartSec=5
MemoryMax=2G # Hard limit – cgroup kills when reached
MemoryHigh=1.8G # Soft limit – kernel warns before hard limit6.3 Extend Stop Timeout When Needed
[Service]
TimeoutStopSec=300 # Give the application 5 minutes to shut down cleanly6.4 Disable OOM Killer for a Specific Service
[Service]
OOMScoreAdjust=-1000 # Makes the service immune to the host OOM KillerUse with caution; if the system truly runs out of memory, the service will not be killed and the kernel may panic.
7. Investigating ulimit Resource Limits
7.1 Process Count Limits
# Show the current process limit for the shell
ulimit -u
# Show the maximum PID value
cat /proc/sys/kernel/pid_max
# View per‑user limits from limits.conf
cat /etc/security/limits.conf7.2 File Descriptor Limits
# Current FD limit
ulimit -n
# System‑wide limit
cat /proc/sys/fs/file-max
# Currently used descriptors
cat /proc/sys/fs/file-nrWhen the limit is reached, new files or sockets cannot be opened, resulting in "Too many open files" errors.
7.3 Adjust Limits in limits.conf and systemd
* soft nofile 1000000
* hard nofile 1000000
* soft nproc 65535
* hard nproc 65535 [Service]
LimitNOFILE=1000000
LimitNPROC=655358. Investigating Manual kill Actions
8.1 Audit Logs
# Enable and start auditd
systemctl enable auditd
systemctl start auditd
# Search for kill commands
ausearch -k kill
# Find kills targeting a specific PID
ausearch -p PID8.2 Shell History
# User's bash history
cat /home/username/.bash_history
# Recent commands
history8.3 Container‑Level Kills
# Docker exit code (137 = SIGKILL, 143 = SIGTERM)
docker inspect container_name --format='{{.State.ExitCode}}'
# Kubernetes pod termination reason
kubectl describe pod pod_name | grep -A 10 "Last State"
kubectl get events | grep pod_name9. Memory‑Related Diagnostic Commands Summary
9.1 System‑Level Memory Analysis
# Overview
free -h
# Detailed info
cat /proc/meminfo
# Top memory consumers
ps aux --sort=-rss | head -159.2 Process‑Level Memory Analysis
# Detailed memory map
pmap -x PID
# RSS and VSZ sorted list
ps -eo pid,vsz,rss,pmem,comm --sort=-rss | head -15
# OOM scores
cat /proc/PID/oom_score
cat /proc/PID/oom_score_adj9.3 Swap Usage
# Show swap usage
swapon -s
free -h
# Which processes use swap
smem -r
# Manual scan of /proc/*/status for VmSwap > 0
for f in /proc/*/status; do awk '/VmSwap/{if($2>0) print $0}' $f; done10. Preventive Measures
10.1 Deploy Memory Monitoring
# Prometheus alert example
- alert: HighMemoryUsage
expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage detected"10.2 Enforce Service Memory Limits
[Service]
MemoryMax=4G
MemoryHigh=3.5G10.3 Reserve Memory for Critical System Services
# Set a low memory guarantee for system processes
echo 1024 > /sys/fs/cgroup/memory/low10.4 Add Swap Space
# Create an 8 GB swap file
fallocate -l 8G /swapfile
chmod 600 /swapfile
mkswap /swapfile
swapon /swapfile
# Persist in /etc/fstab
/swapfile none swap sw 0 010.5 Application‑Level Optimizations
Detect and fix memory leaks (valgrind, AddressSanitizer).
Limit per‑request memory usage (e.g., file‑upload size).
Use connection pools to avoid creating a new DB connection per request.
Configure an appropriate number of workers and per‑worker request limits.
11. Typical Case Studies
Case 1 – Java Process Killed by OOM Killer
Symptoms: Java process disappears; dmesg shows OOM record.
# Verify OOM
dmesg | grep -i oom
# Inspect Java memory usage
ps -eo pid,vsz,rss,pmem,comm | grep java
# Check JVM heap settings
jcmd PID GC.heap_info
# Review system memory and off‑heap usage
cat /proc/meminfo | grep -E "MemTotal|MemAvailable|AnonPages"Root Cause: JVM heap was set to 4 GB, but Metaspace, DirectByteBuffer, and native allocations added another ~2 GB. On an 8 GB host shared by multiple services, total usage exceeded available memory, triggering OOM.
Solution: Reduce JVM heap, increase physical memory, or apply a cgroup memory limit to the service.
Case 2 – systemd Service Killed After Timeout
Symptoms: systemctl stop myservice hangs, then the service is force‑killed.
# View stop‑time logs
journalctl -u myservice -n 50
# Check TimeoutStopSec in the unit file
cat /etc/systemd/system/myservice.service | grep TimeoutRoot Cause: The application needed more than the default 90 s to shut down (e.g., closing DB pools, flushing caches).
Solution: Increase TimeoutStopSec or improve the shutdown logic.
Case 3 – Container Process Killed by cgroup Limit
Symptoms: Docker container exits with code 137.
# Inspect exit code
docker inspect container --format='{{.State.ExitCode}}'
# Check memory limit
docker inspect container --format='{{.HostConfig.Memory}}'
# Monitor container memory usage
docker stats container --no-streamRoot Cause: Container memory limit was 512 MB while the application required ~600 MB.
Solution: Raise the container's memory limit or optimise the application's memory consumption.
12. Troubleshooting Flow Summary
When a process disappears, follow these steps in order:
Check dmesg for OOM entries.
Review journalctl logs if the process is managed by systemd.
Inspect the exit code (e.g., 137 = SIGKILL).
Search audit logs ( ausearch) for manual kills.
Examine signal information in /proc/PID/status.
Analyse resource usage: memory, CPU, file descriptors, process count.
Verify network and storage dependencies; the process may exit voluntarily if a required service is unavailable.
13. Final Thoughts
Process disappearance has no universal formula; it requires correlating signals, resource states, and logs. OOM Killer is the most frequent cause (visible in dmesg and exit code 137), followed by systemd timeouts and cgroup limits in containerised environments. Implementing proactive monitoring, sensible resource limits, and appropriate oom_score_adj settings dramatically reduces unexpected process loss.
Ops Community
A leading IT operations community where professionals share and grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
