Solve System Issues Fast with Linux Log Analysis
This guide walks Linux operators through the core log architecture, essential log files, powerful command‑line tools such as grep, awk, sed and journalctl, and step‑by‑step troubleshooting scenarios—including SSH connectivity, service failures, disk space, memory leaks, security incidents, and application logs—while providing ready‑to‑run scripts and advanced techniques for automated and centralized log analysis.
Introduction
System logs are the first‑hand evidence for operations engineers when a server behaves unexpectedly. This guide targets junior to mid‑level engineers, focusing on the most useful logs and analysis commands rather than exhaustive coverage.
1. Linux Log Basics
1.1 Log System Architecture
Linux logging consists of three components: the syslog daemon , rsyslog service , and systemd‑journal . Older CentOS 6 and earlier use syslog, CentOS 7+ and most modern distributions use rsyslog, while recent releases (CentOS 8, Ubuntu 20.04+, Debian 10+) also provide systemd‑journald with structured, indexed binary logs.
1.2 Important Log Files
/var/log/messages– main system log (kernel, services) on RHEL/CentOS. /var/log/syslog – main system log on Debian/Ubuntu. /var/log/dmesg – kernel ring buffer, useful for hardware and driver issues. /var/log/secure – authentication and sudo events. /var/log/audit/audit.log – SELinux audit events (only when auditd is installed). /var/log/yum.log – package manager actions. /var/log/cron – crontab execution. /var/log/maillog – mail server activity. /var/log/httpd/ or /var/log/nginx/ – web server access and error logs. /var/log/mysql/ or /var/log/mariadb/ – database logs. /var/log/boot.log – boot‑time service initialization.
1.3 Basic journalctl Usage
journalctlreads the binary journal: journalctl Show the latest entries and follow them (like tail -f): journalctl -f Filter by time range:
# last 10 minutes
journalctl --since "10 minutes ago"
# specific timestamp
journalctl --since "2026-05-13 10:00:00"
# range
journalctl --since "2026-05-13 10:00:00" --until "2026-05-13 11:00:00"
# yesterday
journalctl --since yesterday --until todayFilter by service:
journalctl -u nginx.service
journalctl -u sshd.service
journalctl -u kubelet -u containerdShow kernel messages (equivalent to dmesg):
journalctl -k1.4 Log Levels
Standard syslog levels (0‑7) range from emerg (system unusable) to debug . When troubleshooting, focus on error , warning , crit , alert , and emerg :
# show error and higher
journalctl -p err
# show warning and higher
journalctl -p warning
# exact priority
journalctl PRIORITY=32. Common Log‑Analysis Commands
2.1 The grep Family
Basic search:
# search for "error" in messages
grep "error" /var/log/messages
# OR search for multiple keywords
grep -E "error|warning|fail" /var/log/messages
# AND search (pipe)
grep "error" /var/log/messages | grep "mysql"
# exclude keywords
grep -v "debug" /var/log/messages
# show line numbers
grep -n "error" /var/log/messages
# show context
grep -C 5 "error" /var/log/messages
# count matches
grep -c "error" /var/log/messages
# case‑insensitive
grep -i "error" /var/log/messages2.2 awk Basics
Extract columns (messages format is "time host service[PID]: message"):
# print fields 5‑10
awk '{print $5, $6, $7, $8, $9, $10}' /var/log/messages
# specific columns
awk '{print $1, $2, $5}' /var/log/messages
# conditional filtering
awk '$5 == "sshd"' /var/log/messages
awk '$5 ~ /nginx/' /var/log/messages
awk '/error/' /var/log/messages
# count occurrences per service
awk '{print $5}' /var/log/messages | sort | uniq -c | sort -rn
# count error lines
awk '/error/' /var/log/messages | wc -l
# hourly aggregation
awk '{print $3}' /var/log/messages | cut -d: -f1 | sort | uniq -c2.3 sed Basics
Replace text:
# replace all occurrences
sed 's/error/ERROR/g' /var/log/messages
# replace only first per line
sed 's/error/ERROR/' /var/log/messages
# in‑place edit (use with caution)
sed -i 's/error/ERROR/g' /var/log/messages
# backup before edit
sed -i.bak 's/error/ERROR/g' /var/log/messages
# delete lines matching pattern
sed '/debug/d' /var/log/messages
sed '/^$/d' /var/log/messages
# print specific line numbers
sed -n '100p' /var/log/messages
sed -n '50,100p' /var/log/messages
sed -n '$!p' /var/log/messages2.4 Command Composition
Pipe commands for complex analysis:
# count error occurrences per hour
grep "error" /var/log/messages | awk '{print $3}' | cut -d: -f1,2 | sort | uniq -c
# most active IPs from SSH failures
grep "Failed password" /var/log/secure | awk '{print $11}' | sort | uniq -c | sort -rn | head -20
# login counts per user
grep "Accepted password" /var/log/secure | awk '{print $9}' | sort | uniq -c | sort -rnExample script for SSH brute‑force analysis ( analyze_ssh.sh) prints failure statistics, top offending IPs, successful logins, and recent failures.
#!/bin/bash
# analyze_ssh.sh – SSH login failure statistics
echo "=== SSH login failure statistics ==="
grep "Failed password" /var/log/secure | wc -l
echo ""
echo "=== Top failing IPs ==="
grep "Failed password" /var/log/secure | awk '{print $11}' | sort | uniq -c | sort -rn | head -10
echo ""
echo "=== Successful login count ==="
grep "Accepted password" /var/log/secure | wc -l
echo ""
echo "=== Users with successful logins ==="
grep "Accepted password" /var/log/secure | awk '{print $9}' | sort | uniq -c | sort -rn | head -10
echo ""
echo "=== Last 10 failed logins ==="
grep "Failed password" /var/log/secure | tail -10 | awk '{print $1, $2, $3, $11, $13}'3. Common Fault‑Scanning Scenarios
3.1 Scenario 1 – Server Cannot Be Remotely Connected
Try alternative access (VNC, IPMI). If unavailable, the issue is likely network‑level.
Test connectivity:
# basic ping
ping -c 5 server_ip
# port test
nc -zv server_ip 22
telnet server_ip 22Check SSH service status locally: systemctl status sshd If stopped, start and enable it:
systemctl start sshd
systemctl enable sshdValidate SSH configuration:
# syntax check
sshd -t
# recent SSH logs
tail -50 /var/log/secure | grep sshd
journalctl -u sshd --since "30 minutes ago"Verify port listening:
netstat -tlnp | grep 22
ss -tlnp | grep 22Inspect firewall rules and SELinux:
# iptables
iptables -L -n | grep 22
# firewalld
firewall-cmd --list-all | grep ssh
# SELinux mode
getenforce
semanage port -l | grep ssh3.2 Scenario 2 – Service Startup Failure (Nginx Example)
Check service status: systemctl status nginx Attempt manual start to see errors:
# stop systemd management
systemctl stop nginx
# syntax check
nginx -t
# manual start
nginxInspect logs:
# error log
tail -100 /var/log/nginx/error.log
# system journal
journalctl -u nginx --no-pager
# messages
grep nginx /var/log/messages | tail -50Common causes:
Port already in use – check with netstat -tlnp | grep :80 or ss -tlnp | grep :80.
Permission problems – examine ls -la /etc/nginx/nginx.conf and SELinux context.
Missing dependencies – look for "undefined symbol" in the error log.
3.3 Scenario 3 – Disk Space Exhaustion
Show overall usage: df -h Find large files (>100 MiB): find / -type f -size +100M -exec ls -lh {} \; 2>/dev/null Identify biggest directories:
du -sh /*
du -sh /var/*
du -sh /home/*Inspect log directories for oversized logs:
du -sh /var/log/*
find /var/log -type f -name "*.log" -exec ls -lh {} \; | sort -k5 -rh | head -20Check for attacker‑placed files in /tmp or /var/tmp:
ls -la /tmp/
ls -la /var/tmp/
find /tmp -type f -newer /tmp/.security -ls 2>/dev/nullClean up:
# truncate logs (keep file, clear content)
> /var/log/messages
> /var/log/secure
# delete old compressed logs
find /var/log -name "*.gz" -mtime +30 -delete
# run logrotate
logrotate -f /etc/logrotate.confPrevent recurrence – configure logrotate for critical logs.
cat /etc/logrotate.conf
cat /etc/logrotate.d/*3.4 Scenario 4 – Memory Leak Detection
Watch memory trend: watch -n 5 free -h List top memory‑hungry processes: ps aux --sort=-%mem | head -20 Deep dive into a process: pmap -x pid | sort -k3 -n -r | head -20 Java processes – generate heap dump and inspect:
# generate heap dump (requires stop or live option)
jmap -dump:format=b,file=heap.bin pid
jmap -heap pid
jmap -histo pid | head -30Native processes – use valgrind (note performance impact):
# install valgrind
yum install valgrind
# analyze process (replace command with the actual command to run)
valgrind --leak-check=full --log-file=/tmp/valgrind.log commandTypical leak causes (listed in a bullet list): C/C++ missing free, long‑lived Java objects, oversized cache settings, unclosed connection pools, non‑exiting threads.
4. Security Log Analysis
4.1 SSH Login Log Analysis
Successful logins:
# password auth
grep "Accepted password" /var/log/secure
# public‑key auth
grep "Accepted publickey" /var/log/secure
# recent logins
last
lastlog
# source IP distribution
grep "Accepted password" /var/log/secure | awk '{print $11}' | sort | uniq -c | sort -rn | head -10Failed logins:
# all failures
grep "Failed password" /var/log/secure
# count failures
grep "Failed password" /var/log/secure | wc -l
# top offending IPs
grep "Failed password" /var/log/secure | awk '{print $11}' | sort | uniq -c | sort -rn | head -10
# top usernames
grep "Failed password" /var/log/secure | awk '{print $9}' | sort | uniq -c | sort -rn | head -10Brute‑force patterns:
# many failures from same IP
grep "Failed password" /var/log/secure | awk '{print $11}' | sort | uniq -c | sort -rn | head -10
# many failures for same user
grep "Failed password" /var/log/secure | awk '{print $9}' | sort | uniq -c | sort -rn | head -104.2 Automatic Protection with fail2ban
Installation and basic jail configuration (example for SSH and Nginx):
# install fail2ban
yum install fail2ban -y
# create local configuration
cat > /etc/fail2ban/jail.local <<'EOF'
[DEFAULT]
bantime = 3600
findtime = 600
maxretry = 5
[sshd]
enabled = true
port = ssh
logpath = /var/log/secure
maxretry = 3
[nginx-http-auth]
enabled = true
port = http,https
logpath = /var/log/nginx/error.log
maxretry = 5
EOF
# enable and start service
systemctl enable fail2ban
systemctl start fail2banCommon commands:
# status
fail2ban-client status
# jail status
fail2ban-client status sshd
# manually ban/unban IP
fail2ban-client set sshd banip 1.2.3.4
fail2ban-client set sshd unbanip 1.2.3.4
# view blocked IPs
iptables -L -n | grep fail2ban4.3 sudo Usage Log
# sudo events
grep sudo /var/log/secure
# count per command
grep sudo /var/log/secure | awk -F: '{print $NF}' | sort | uniq -c | sort -rn4.4 SELinux Audit Log
# AVC denials
ausearch -m avc -ts recent
# filter by service (e.g., nginx)
ausearch -m avc -se nginx
# translate to readable rules
ausearch -m avc --raw | audit2allowCheck SELinux mode and switch temporarily:
getenforce
sestatus
setenforce 0 # permissive
setenforce 1 # enforcing5. Application Log Analysis
5.1 Nginx Logs
Error log inspection:
# recent errors
tail -100 /var/log/nginx/error.log
# specific error patterns
grep "connect() failed" /var/log/nginx/error.log
grep "upstream timed out" /var/log/nginx/error.log
grep "no live upstreams" /var/log/nginx/error.log
# daily error trend example
grep "2026/05/13" /var/log/nginx/error.log | awk '{print $NF}' | sort | uniq -c | sort -rnAccess log statistics:
# HTTP status distribution
awk '{print $9}' /var/log/nginx/access.log | sort | uniq -c | sort -rn
# top client IPs
awk '{print $1}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -20
# most requested URLs
awk '{print $7}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -20
# average response time (assuming $NF holds time in ms)
awk -F'"' '{print $NF}' /var/log/nginx/access.log | awk '{sum+=$1; count++} END {print "Average response time:", sum/count "ms"}'
# slowest requests
awk -F'"' '{print $NF, $7}' /var/log/nginx/access.log | sort -rn | head -205.2 MySQL Logs
Error log:
tail -100 /var/log/mysql/error.log
grep -E "ERROR|warning" /var/log/mysql/error.logSlow‑query log (ensure it is enabled):
# check variables
SHOW VARIABLES LIKE 'slow_query%';
SHOW VARIABLES LIKE 'long_query_time';
# analyze with mysqldumpslow
mysqldumpslow /var/log/mysql/slow-query.log
mysqldumpslow -s t -t 10 /var/log/mysql/slow-query.log # top 10 slowest
mysqldumpslow -s c -t 10 /var/log/mysql/slow-query.log # most frequentBinary log inspection:
# list binlogs
mysql -u root -p -e "SHOW BINARY LOGS;"
# current position
mysql -u root -p -e "SHOW MASTER STATUS;"
# view contents
mysqlbinlog /var/lib/mysql/mysql-bin.000001 | head -1005.3 Docker Container Logs
# docker logs (tail & follow)
docker logs container_id --tail 100 -f
# journalctl view for containerd
journalctl CONTAINER_NAME=container_name --no-pager
# search errors
journalctl CONTAINER_NAME=container_name | grep -i error
# crictl for containerd
crictl logs container_id6. Advanced Log‑Analysis Techniques
6.1 Writing Analysis Scripts
A practical Bash script ( analyze_system.sh) demonstrates automated collection of system errors, SSH login statistics, disk usage, memory/CPU status, and failed services. The script creates a timestamped output directory and writes human‑readable reports.
#!/bin/bash
# analyze_system.sh – system log analysis script
LOG_FILE="/var/log/messages"
SECURE_LOG="/var/log/secure"
OUTPUT_DIR="/tmp/log_analysis_$(date +%Y%m%d_%H%M%S)"
mkdir -p $OUTPUT_DIR
echo "=== System errors and warnings ===" > $OUTPUT_DIR/errors.txt
grep -E "error|warning|critical|alert|emerg" $LOG_FILE >> $OUTPUT_DIR/errors.txt
echo "=== SSH login analysis ===" > $OUTPUT_DIR/ssh_analysis.txt
echo "Successful logins: $(grep 'Accepted' $SECURE_LOG | wc -l)" >> $OUTPUT_DIR/ssh_analysis.txt
echo "Failed logins: $(grep 'Failed' $SECURE_LOG | wc -l)" >> $OUTPUT_DIR/ssh_analysis.txt
echo "Top 10 failing IPs:" >> $OUTPUT_DIR/ssh_analysis.txt
grep 'Failed' $SECURE_LOG | awk '{print $11}' | sort | uniq -c | sort -rn | head -10 >> $OUTPUT_DIR/ssh_analysis.txt
echo "=== Disk usage ===" > $OUTPUT_DIR/disk_usage.txt
df -h >> $OUTPUT_DIR/disk_usage.txt
echo "Large directories (>1G):" >> $OUTPUT_DIR/disk_usage.txt
du -sh /var/* 2>/dev/null | sort -rh | awk '$1 ~ /G/ {print}' >> $OUTPUT_DIR/disk_usage.txt
echo "=== System resources ===" > $OUTPUT_DIR/resources.txt
free -h >> $OUTPUT_DIR/resources.txt
uptime >> $OUTPUT_DIR/resources.txt
echo "=== Failed services ===" > $OUTPUT_DIR/services.txt
systemctl list-units --type=service --state=failed --no-pager >> $OUTPUT_DIR/services.txt
echo "Analysis complete. Results in $OUTPUT_DIR" && ls -la $OUTPUT_DIR6.2 Automated Analysis with logwatch
# install
yum install logwatch -y
# run manually and email report
logwatch --output mail --mailto [email protected] --detail high
# output to file
logwatch --output file --filename /tmp/logwatch.txt --detail high
# focus on a single service
logwatch --service sshd --detail high
# schedule daily run via cron
0 8 * * * /usr/sbin/logwatch --output mail --mailto [email protected]6.3 Centralized Log Management
Remote rsyslog collection:
# client /etc/rsyslog.conf
*.* @@remote-server:514
# server side
module(load="imtcp")
input(type="imtcp" port="514")
template(name="RemoteLogs" type="string" string="/var/log/remote/%HOSTNAME%/%PROGRAMNAME%.log")
*.* ?RemoteLogsELK stack components (Filebeat → Logstash → Elasticsearch → Kibana) for scalable search and visualization.
6.4 Real‑Time Monitoring and Alerting
Using inotifywait to watch /var/log/secure for new SSH failures and send an email alert:
# install inotify-tools
yum install inotify-tools -y
# monitor loop
inotifywait -m -e modify /var/log/secure | while read path action file; do
if grep -q "Failed password" "$path$file"; then
echo "Detected SSH failure: $(tail -1 $path$file)" | mail -s "SSH login alert" [email protected]
fi
done7. Common Log Pattern Identification
7.1 Detecting OOM Killer Events
# dmesg search
dmesg | grep -i "out of memory"
dmesg | grep -i "killed process"
# messages file
grep -i "oom" /var/log/messages
# example line
[Mon May 13 10:00:00 2024] Out of memory: Kill process 12345 (java) score 900 or sacrifice child
[Mon May 13 10:00:00 2024] Killed process 12345 (java) total-vm: 8000000kB, anon-rss: 7500000kB, file-rss: 0kBFollow‑up analysis: identify the killed process, check memory usage trends with free -h and vmstat, and adjust service memory limits.
7.2 Identifying Disk I/O Problems
grep -i "io timeout" /var/log/messages
grep -i "ext4" /var/log/messages | grep -i error
dmesg | grep -i "sd[a-z]" | grep -i error
dmesg | grep -i "ata" | grep -i error7.3 Recognizing Network Issues
# retransmits and failures
netstat -s | grep -i retransmit
netstat -s | grep -i failed
# NIC status
dmesg | grep -i eth0
ip -s link show eth0
# packet loss
netstat -i
ip -s link show7.4 Spotting Service Crashes
# core dumps
ls -la /var/crash/
find /var/crash -name "core.*" -ls
# segfaults
grep -i "segfault" /var/log/messages
dmesg | grep -i segfault
# ABRT reports
ls -la /var/spool/abrt/8. Real‑World Cases
8.1 Case – Frequent Reboots Caused by OOM Killer
Background: A server rebooted 2‑3 times daily, disrupting services.
Investigation Steps:
Check reboot timestamps:
last reboot
who -b
last | head -20Inspect /var/log/messages around each reboot – found大量 OOM Killer entries.
Confirm with dmesg:
dmesg | grep -i oom
dmesg | grep -i killIdentify killed process – MySQL ( mysqld).
Analyze memory usage:
free -h
ps aux --sort=-%mem | head -10Root cause: MySQL innodb_buffer_pool_size set to 16 GB on a 32 GB machine, leaving insufficient memory for other services.
Resolution:
Reduce buffer pool size to 8 GB (runtime and permanent config).
SET GLOBAL innodb_buffer_pool_size = 8589934592; # 8 GB
# permanent change
innodb_buffer_pool_size = 8GRestart MySQL and monitor memory.
systemctl restart mariadb
watch -n 5 free -hTakeaway: OOM Killer logs pinpoint the offending process; adjusting memory limits stabilizes the system.
8.2 Case – Detecting an Intrusion via Log Analysis
Background: A security scan indicated a possible compromise.
Investigation Steps:
Search SSH failure attempts:
grep "Failed password" /var/log/secure | tail -100
grep "Failed password" /var/log/secure | awk '{print $11}' | sort | uniq -c | sort -rn | head -20Found an IP with thousands of failures – typical brute‑force activity.
Search successful logins for unknown users:
grep "Accepted password" /var/log/secure | awk '{print $9, $11, $12}' | sort | uniq -c | sort -rnDiscovered a login from a user not present in /etc/passwd.
Checked recent command history and suspicious files in /tmp and /var/tmp.
# list setuid binaries
find / -type f -perm -4000 -ls 2>/dev/null
# list /tmp contents
ls -la /tmp/
# find new files in /tmp
find /tmp -type f -newer /tmp/.security -ls 2>/dev/nullLooked for newly added setuid binaries and hidden files.
find / -type f -perm -4000 -ls 2>/dev/null
ls -la /tmp/
find /tmp -type f -newer /tmp/.security -ls 2>/dev/nullRoot Cause: An attacker succeeded in SSH brute‑force, obtained a low‑privilege account, and attempted privilege escalation.
Remediation:
Immediately block the attacker IP: iptables -I INPUT -s attacker_ip -j DROP Audit for malicious processes and files:
ps aux | grep suspicious
lsof | grep suspiciousBackup data and reinstall the OS to ensure a clean state.
Harden SSH – disable password auth, enforce key‑based login, disable root login.
# /etc/ssh/sshd_config
PasswordAuthentication no
PermitRootLogin no
systemctl restart sshdDeploy fail2ban to automatically ban repeated failures.
Lesson: Regular log review, combined with tools like fail2ban, can quickly surface brute‑force attempts and limit exposure.
9. Conclusion
Log analysis is a foundational skill for operations engineers. Mastering log locations, command‑line tools, systematic troubleshooting steps, and automation (scripts, logwatch, centralized logging) enables rapid diagnosis of system, service, and security issues.
Know where each log lives and what it records.
Use grep, awk, sed, and journalctl to filter, extract, and aggregate data.
Approach problems from time, keyword, frequency, and trend perspectives.
Close the loop: detect → isolate → fix → verify → document.
Consistent practice turns log analysis from a reactive chore into a proactive, data‑driven operation.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
