Operations 41 min read

Solve System Issues Fast with Linux Log Analysis

This guide walks Linux operators through the core log architecture, essential log files, powerful command‑line tools such as grep, awk, sed and journalctl, and step‑by‑step troubleshooting scenarios—including SSH connectivity, service failures, disk space, memory leaks, security incidents, and application logs—while providing ready‑to‑run scripts and advanced techniques for automated and centralized log analysis.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
Solve System Issues Fast with Linux Log Analysis

Introduction

System logs are the first‑hand evidence for operations engineers when a server behaves unexpectedly. This guide targets junior to mid‑level engineers, focusing on the most useful logs and analysis commands rather than exhaustive coverage.

1. Linux Log Basics

1.1 Log System Architecture

Linux logging consists of three components: the syslog daemon , rsyslog service , and systemd‑journal . Older CentOS 6 and earlier use syslog, CentOS 7+ and most modern distributions use rsyslog, while recent releases (CentOS 8, Ubuntu 20.04+, Debian 10+) also provide systemd‑journald with structured, indexed binary logs.

1.2 Important Log Files

/var/log/messages

– main system log (kernel, services) on RHEL/CentOS. /var/log/syslog – main system log on Debian/Ubuntu. /var/log/dmesg – kernel ring buffer, useful for hardware and driver issues. /var/log/secure – authentication and sudo events. /var/log/audit/audit.log – SELinux audit events (only when auditd is installed). /var/log/yum.log – package manager actions. /var/log/cron – crontab execution. /var/log/maillog – mail server activity. /var/log/httpd/ or /var/log/nginx/ – web server access and error logs. /var/log/mysql/ or /var/log/mariadb/ – database logs. /var/log/boot.log – boot‑time service initialization.

1.3 Basic journalctl Usage

journalctl

reads the binary journal: journalctl Show the latest entries and follow them (like tail -f): journalctl -f Filter by time range:

# last 10 minutes
journalctl --since "10 minutes ago"
# specific timestamp
journalctl --since "2026-05-13 10:00:00"
# range
journalctl --since "2026-05-13 10:00:00" --until "2026-05-13 11:00:00"
# yesterday
journalctl --since yesterday --until today

Filter by service:

journalctl -u nginx.service
journalctl -u sshd.service
journalctl -u kubelet -u containerd

Show kernel messages (equivalent to dmesg):

journalctl -k

1.4 Log Levels

Standard syslog levels (0‑7) range from emerg (system unusable) to debug . When troubleshooting, focus on error , warning , crit , alert , and emerg :

# show error and higher
journalctl -p err
# show warning and higher
journalctl -p warning
# exact priority
journalctl PRIORITY=3

2. Common Log‑Analysis Commands

2.1 The grep Family

Basic search:

# search for "error" in messages
grep "error" /var/log/messages
# OR search for multiple keywords
grep -E "error|warning|fail" /var/log/messages
# AND search (pipe)
grep "error" /var/log/messages | grep "mysql"
# exclude keywords
grep -v "debug" /var/log/messages
# show line numbers
grep -n "error" /var/log/messages
# show context
grep -C 5 "error" /var/log/messages
# count matches
grep -c "error" /var/log/messages
# case‑insensitive
grep -i "error" /var/log/messages

2.2 awk Basics

Extract columns (messages format is "time host service[PID]: message"):

# print fields 5‑10
awk '{print $5, $6, $7, $8, $9, $10}' /var/log/messages
# specific columns
awk '{print $1, $2, $5}' /var/log/messages
# conditional filtering
awk '$5 == "sshd"' /var/log/messages
awk '$5 ~ /nginx/' /var/log/messages
awk '/error/' /var/log/messages
# count occurrences per service
awk '{print $5}' /var/log/messages | sort | uniq -c | sort -rn
# count error lines
awk '/error/' /var/log/messages | wc -l
# hourly aggregation
awk '{print $3}' /var/log/messages | cut -d: -f1 | sort | uniq -c

2.3 sed Basics

Replace text:

# replace all occurrences
sed 's/error/ERROR/g' /var/log/messages
# replace only first per line
sed 's/error/ERROR/' /var/log/messages
# in‑place edit (use with caution)
sed -i 's/error/ERROR/g' /var/log/messages
# backup before edit
sed -i.bak 's/error/ERROR/g' /var/log/messages
# delete lines matching pattern
sed '/debug/d' /var/log/messages
sed '/^$/d' /var/log/messages
# print specific line numbers
sed -n '100p' /var/log/messages
sed -n '50,100p' /var/log/messages
sed -n '$!p' /var/log/messages

2.4 Command Composition

Pipe commands for complex analysis:

# count error occurrences per hour
grep "error" /var/log/messages | awk '{print $3}' | cut -d: -f1,2 | sort | uniq -c
# most active IPs from SSH failures
grep "Failed password" /var/log/secure | awk '{print $11}' | sort | uniq -c | sort -rn | head -20
# login counts per user
grep "Accepted password" /var/log/secure | awk '{print $9}' | sort | uniq -c | sort -rn

Example script for SSH brute‑force analysis ( analyze_ssh.sh) prints failure statistics, top offending IPs, successful logins, and recent failures.

#!/bin/bash
# analyze_ssh.sh – SSH login failure statistics

echo "=== SSH login failure statistics ==="
grep "Failed password" /var/log/secure | wc -l

echo ""
echo "=== Top failing IPs ==="
grep "Failed password" /var/log/secure | awk '{print $11}' | sort | uniq -c | sort -rn | head -10

echo ""
echo "=== Successful login count ==="
grep "Accepted password" /var/log/secure | wc -l

echo ""
echo "=== Users with successful logins ==="
grep "Accepted password" /var/log/secure | awk '{print $9}' | sort | uniq -c | sort -rn | head -10

echo ""
echo "=== Last 10 failed logins ==="
grep "Failed password" /var/log/secure | tail -10 | awk '{print $1, $2, $3, $11, $13}'

3. Common Fault‑Scanning Scenarios

3.1 Scenario 1 – Server Cannot Be Remotely Connected

Try alternative access (VNC, IPMI). If unavailable, the issue is likely network‑level.

Test connectivity:

# basic ping
ping -c 5 server_ip
# port test
nc -zv server_ip 22
telnet server_ip 22

Check SSH service status locally: systemctl status sshd If stopped, start and enable it:

systemctl start sshd
systemctl enable sshd

Validate SSH configuration:

# syntax check
sshd -t
# recent SSH logs
tail -50 /var/log/secure | grep sshd
journalctl -u sshd --since "30 minutes ago"

Verify port listening:

netstat -tlnp | grep 22
ss -tlnp | grep 22

Inspect firewall rules and SELinux:

# iptables
iptables -L -n | grep 22
# firewalld
firewall-cmd --list-all | grep ssh
# SELinux mode
getenforce
semanage port -l | grep ssh

3.2 Scenario 2 – Service Startup Failure (Nginx Example)

Check service status: systemctl status nginx Attempt manual start to see errors:

# stop systemd management
systemctl stop nginx
# syntax check
nginx -t
# manual start
nginx

Inspect logs:

# error log
tail -100 /var/log/nginx/error.log
# system journal
journalctl -u nginx --no-pager
# messages
grep nginx /var/log/messages | tail -50

Common causes:

Port already in use – check with netstat -tlnp | grep :80 or ss -tlnp | grep :80.

Permission problems – examine ls -la /etc/nginx/nginx.conf and SELinux context.

Missing dependencies – look for "undefined symbol" in the error log.

3.3 Scenario 3 – Disk Space Exhaustion

Show overall usage: df -h Find large files (>100 MiB): find / -type f -size +100M -exec ls -lh {} \; 2>/dev/null Identify biggest directories:

du -sh /*
du -sh /var/*
du -sh /home/*

Inspect log directories for oversized logs:

du -sh /var/log/*
find /var/log -type f -name "*.log" -exec ls -lh {} \; | sort -k5 -rh | head -20

Check for attacker‑placed files in /tmp or /var/tmp:

ls -la /tmp/
ls -la /var/tmp/
find /tmp -type f -newer /tmp/.security -ls 2>/dev/null

Clean up:

# truncate logs (keep file, clear content)
> /var/log/messages
> /var/log/secure
# delete old compressed logs
find /var/log -name "*.gz" -mtime +30 -delete
# run logrotate
logrotate -f /etc/logrotate.conf

Prevent recurrence – configure logrotate for critical logs.

cat /etc/logrotate.conf
cat /etc/logrotate.d/*

3.4 Scenario 4 – Memory Leak Detection

Watch memory trend: watch -n 5 free -h List top memory‑hungry processes: ps aux --sort=-%mem | head -20 Deep dive into a process: pmap -x pid | sort -k3 -n -r | head -20 Java processes – generate heap dump and inspect:

# generate heap dump (requires stop or live option)
jmap -dump:format=b,file=heap.bin pid
jmap -heap pid
jmap -histo pid | head -30

Native processes – use valgrind (note performance impact):

# install valgrind
yum install valgrind
# analyze process (replace command with the actual command to run)
valgrind --leak-check=full --log-file=/tmp/valgrind.log command

Typical leak causes (listed in a bullet list): C/C++ missing free, long‑lived Java objects, oversized cache settings, unclosed connection pools, non‑exiting threads.

4. Security Log Analysis

4.1 SSH Login Log Analysis

Successful logins:

# password auth
grep "Accepted password" /var/log/secure
# public‑key auth
grep "Accepted publickey" /var/log/secure
# recent logins
last
lastlog
# source IP distribution
grep "Accepted password" /var/log/secure | awk '{print $11}' | sort | uniq -c | sort -rn | head -10

Failed logins:

# all failures
grep "Failed password" /var/log/secure
# count failures
grep "Failed password" /var/log/secure | wc -l
# top offending IPs
grep "Failed password" /var/log/secure | awk '{print $11}' | sort | uniq -c | sort -rn | head -10
# top usernames
grep "Failed password" /var/log/secure | awk '{print $9}' | sort | uniq -c | sort -rn | head -10

Brute‑force patterns:

# many failures from same IP
grep "Failed password" /var/log/secure | awk '{print $11}' | sort | uniq -c | sort -rn | head -10
# many failures for same user
grep "Failed password" /var/log/secure | awk '{print $9}' | sort | uniq -c | sort -rn | head -10

4.2 Automatic Protection with fail2ban

Installation and basic jail configuration (example for SSH and Nginx):

# install fail2ban
yum install fail2ban -y

# create local configuration
cat > /etc/fail2ban/jail.local <<'EOF'
[DEFAULT]
bantime = 3600
findtime = 600
maxretry = 5

[sshd]
enabled = true
port = ssh
logpath = /var/log/secure
maxretry = 3

[nginx-http-auth]
enabled = true
port = http,https
logpath = /var/log/nginx/error.log
maxretry = 5
EOF

# enable and start service
systemctl enable fail2ban
systemctl start fail2ban

Common commands:

# status
fail2ban-client status
# jail status
fail2ban-client status sshd
# manually ban/unban IP
fail2ban-client set sshd banip 1.2.3.4
fail2ban-client set sshd unbanip 1.2.3.4
# view blocked IPs
iptables -L -n | grep fail2ban

4.3 sudo Usage Log

# sudo events
grep sudo /var/log/secure
# count per command
grep sudo /var/log/secure | awk -F: '{print $NF}' | sort | uniq -c | sort -rn

4.4 SELinux Audit Log

# AVC denials
ausearch -m avc -ts recent
# filter by service (e.g., nginx)
ausearch -m avc -se nginx
# translate to readable rules
ausearch -m avc --raw | audit2allow

Check SELinux mode and switch temporarily:

getenforce
sestatus
setenforce 0   # permissive
setenforce 1   # enforcing

5. Application Log Analysis

5.1 Nginx Logs

Error log inspection:

# recent errors
tail -100 /var/log/nginx/error.log
# specific error patterns
grep "connect() failed" /var/log/nginx/error.log
grep "upstream timed out" /var/log/nginx/error.log
grep "no live upstreams" /var/log/nginx/error.log
# daily error trend example
grep "2026/05/13" /var/log/nginx/error.log | awk '{print $NF}' | sort | uniq -c | sort -rn

Access log statistics:

# HTTP status distribution
awk '{print $9}' /var/log/nginx/access.log | sort | uniq -c | sort -rn
# top client IPs
awk '{print $1}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -20
# most requested URLs
awk '{print $7}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -20
# average response time (assuming $NF holds time in ms)
awk -F'"' '{print $NF}' /var/log/nginx/access.log | awk '{sum+=$1; count++} END {print "Average response time:", sum/count "ms"}'
# slowest requests
awk -F'"' '{print $NF, $7}' /var/log/nginx/access.log | sort -rn | head -20

5.2 MySQL Logs

Error log:

tail -100 /var/log/mysql/error.log
grep -E "ERROR|warning" /var/log/mysql/error.log

Slow‑query log (ensure it is enabled):

# check variables
SHOW VARIABLES LIKE 'slow_query%';
SHOW VARIABLES LIKE 'long_query_time';
# analyze with mysqldumpslow
mysqldumpslow /var/log/mysql/slow-query.log
mysqldumpslow -s t -t 10 /var/log/mysql/slow-query.log   # top 10 slowest
mysqldumpslow -s c -t 10 /var/log/mysql/slow-query.log   # most frequent

Binary log inspection:

# list binlogs
mysql -u root -p -e "SHOW BINARY LOGS;"
# current position
mysql -u root -p -e "SHOW MASTER STATUS;"
# view contents
mysqlbinlog /var/lib/mysql/mysql-bin.000001 | head -100

5.3 Docker Container Logs

# docker logs (tail & follow)
docker logs container_id --tail 100 -f
# journalctl view for containerd
journalctl CONTAINER_NAME=container_name --no-pager
# search errors
journalctl CONTAINER_NAME=container_name | grep -i error
# crictl for containerd
crictl logs container_id

6. Advanced Log‑Analysis Techniques

6.1 Writing Analysis Scripts

A practical Bash script ( analyze_system.sh) demonstrates automated collection of system errors, SSH login statistics, disk usage, memory/CPU status, and failed services. The script creates a timestamped output directory and writes human‑readable reports.

#!/bin/bash
# analyze_system.sh – system log analysis script
LOG_FILE="/var/log/messages"
SECURE_LOG="/var/log/secure"
OUTPUT_DIR="/tmp/log_analysis_$(date +%Y%m%d_%H%M%S)"
mkdir -p $OUTPUT_DIR

echo "=== System errors and warnings ===" > $OUTPUT_DIR/errors.txt
grep -E "error|warning|critical|alert|emerg" $LOG_FILE >> $OUTPUT_DIR/errors.txt

echo "=== SSH login analysis ===" > $OUTPUT_DIR/ssh_analysis.txt
echo "Successful logins: $(grep 'Accepted' $SECURE_LOG | wc -l)" >> $OUTPUT_DIR/ssh_analysis.txt
echo "Failed logins: $(grep 'Failed' $SECURE_LOG | wc -l)" >> $OUTPUT_DIR/ssh_analysis.txt

echo "Top 10 failing IPs:" >> $OUTPUT_DIR/ssh_analysis.txt
grep 'Failed' $SECURE_LOG | awk '{print $11}' | sort | uniq -c | sort -rn | head -10 >> $OUTPUT_DIR/ssh_analysis.txt

echo "=== Disk usage ===" > $OUTPUT_DIR/disk_usage.txt
df -h >> $OUTPUT_DIR/disk_usage.txt
echo "Large directories (>1G):" >> $OUTPUT_DIR/disk_usage.txt
du -sh /var/* 2>/dev/null | sort -rh | awk '$1 ~ /G/ {print}' >> $OUTPUT_DIR/disk_usage.txt

echo "=== System resources ===" > $OUTPUT_DIR/resources.txt
free -h >> $OUTPUT_DIR/resources.txt
uptime >> $OUTPUT_DIR/resources.txt

echo "=== Failed services ===" > $OUTPUT_DIR/services.txt
systemctl list-units --type=service --state=failed --no-pager >> $OUTPUT_DIR/services.txt

echo "Analysis complete. Results in $OUTPUT_DIR" && ls -la $OUTPUT_DIR

6.2 Automated Analysis with logwatch

# install
yum install logwatch -y
# run manually and email report
logwatch --output mail --mailto [email protected] --detail high
# output to file
logwatch --output file --filename /tmp/logwatch.txt --detail high
# focus on a single service
logwatch --service sshd --detail high
# schedule daily run via cron
0 8 * * * /usr/sbin/logwatch --output mail --mailto [email protected]

6.3 Centralized Log Management

Remote rsyslog collection:

# client /etc/rsyslog.conf
*.* @@remote-server:514
# server side
module(load="imtcp")
input(type="imtcp" port="514")
template(name="RemoteLogs" type="string" string="/var/log/remote/%HOSTNAME%/%PROGRAMNAME%.log")
*.* ?RemoteLogs

ELK stack components (Filebeat → Logstash → Elasticsearch → Kibana) for scalable search and visualization.

6.4 Real‑Time Monitoring and Alerting

Using inotifywait to watch /var/log/secure for new SSH failures and send an email alert:

# install inotify-tools
yum install inotify-tools -y
# monitor loop
inotifywait -m -e modify /var/log/secure | while read path action file; do
  if grep -q "Failed password" "$path$file"; then
    echo "Detected SSH failure: $(tail -1 $path$file)" | mail -s "SSH login alert" [email protected]
  fi
done

7. Common Log Pattern Identification

7.1 Detecting OOM Killer Events

# dmesg search
 dmesg | grep -i "out of memory"
 dmesg | grep -i "killed process"
# messages file
 grep -i "oom" /var/log/messages
# example line
[Mon May 13 10:00:00 2024] Out of memory: Kill process 12345 (java) score 900 or sacrifice child
[Mon May 13 10:00:00 2024] Killed process 12345 (java) total-vm: 8000000kB, anon-rss: 7500000kB, file-rss: 0kB

Follow‑up analysis: identify the killed process, check memory usage trends with free -h and vmstat, and adjust service memory limits.

7.2 Identifying Disk I/O Problems

grep -i "io timeout" /var/log/messages
grep -i "ext4" /var/log/messages | grep -i error
dmesg | grep -i "sd[a-z]" | grep -i error
dmesg | grep -i "ata" | grep -i error

7.3 Recognizing Network Issues

# retransmits and failures
netstat -s | grep -i retransmit
netstat -s | grep -i failed
# NIC status
dmesg | grep -i eth0
ip -s link show eth0
# packet loss
netstat -i
ip -s link show

7.4 Spotting Service Crashes

# core dumps
ls -la /var/crash/
find /var/crash -name "core.*" -ls
# segfaults
grep -i "segfault" /var/log/messages
dmesg | grep -i segfault
# ABRT reports
ls -la /var/spool/abrt/

8. Real‑World Cases

8.1 Case – Frequent Reboots Caused by OOM Killer

Background: A server rebooted 2‑3 times daily, disrupting services.

Investigation Steps:

Check reboot timestamps:

last reboot
who -b
last | head -20

Inspect /var/log/messages around each reboot – found大量 OOM Killer entries.

Confirm with dmesg:

dmesg | grep -i oom
 dmesg | grep -i kill

Identify killed process – MySQL ( mysqld).

Analyze memory usage:

free -h
 ps aux --sort=-%mem | head -10

Root cause: MySQL innodb_buffer_pool_size set to 16 GB on a 32 GB machine, leaving insufficient memory for other services.

Resolution:

Reduce buffer pool size to 8 GB (runtime and permanent config).

SET GLOBAL innodb_buffer_pool_size = 8589934592;  # 8 GB
# permanent change
innodb_buffer_pool_size = 8G

Restart MySQL and monitor memory.

systemctl restart mariadb
watch -n 5 free -h

Takeaway: OOM Killer logs pinpoint the offending process; adjusting memory limits stabilizes the system.

8.2 Case – Detecting an Intrusion via Log Analysis

Background: A security scan indicated a possible compromise.

Investigation Steps:

Search SSH failure attempts:

grep "Failed password" /var/log/secure | tail -100
grep "Failed password" /var/log/secure | awk '{print $11}' | sort | uniq -c | sort -rn | head -20

Found an IP with thousands of failures – typical brute‑force activity.

Search successful logins for unknown users:

grep "Accepted password" /var/log/secure | awk '{print $9, $11, $12}' | sort | uniq -c | sort -rn

Discovered a login from a user not present in /etc/passwd.

Checked recent command history and suspicious files in /tmp and /var/tmp.

# list setuid binaries
find / -type f -perm -4000 -ls 2>/dev/null
# list /tmp contents
ls -la /tmp/
# find new files in /tmp
find /tmp -type f -newer /tmp/.security -ls 2>/dev/null

Looked for newly added setuid binaries and hidden files.

find / -type f -perm -4000 -ls 2>/dev/null
ls -la /tmp/
find /tmp -type f -newer /tmp/.security -ls 2>/dev/null

Root Cause: An attacker succeeded in SSH brute‑force, obtained a low‑privilege account, and attempted privilege escalation.

Remediation:

Immediately block the attacker IP: iptables -I INPUT -s attacker_ip -j DROP Audit for malicious processes and files:

ps aux | grep suspicious
lsof | grep suspicious

Backup data and reinstall the OS to ensure a clean state.

Harden SSH – disable password auth, enforce key‑based login, disable root login.

# /etc/ssh/sshd_config
PasswordAuthentication no
PermitRootLogin no
systemctl restart sshd

Deploy fail2ban to automatically ban repeated failures.

Lesson: Regular log review, combined with tools like fail2ban, can quickly surface brute‑force attempts and limit exposure.

9. Conclusion

Log analysis is a foundational skill for operations engineers. Mastering log locations, command‑line tools, systematic troubleshooting steps, and automation (scripts, logwatch, centralized logging) enables rapid diagnosis of system, service, and security issues.

Know where each log lives and what it records.

Use grep, awk, sed, and journalctl to filter, extract, and aggregate data.

Approach problems from time, keyword, frequency, and trend perspectives.

Close the loop: detect → isolate → fix → verify → document.

Consistent practice turns log analysis from a reactive chore into a proactive, data‑driven operation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LinuxSecurityTroubleshootinglog analysisgrepawksedjournalctlsystem logs
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.