Operations 15 min read

20 Essential Linux & Kubernetes Troubleshooting Commands Every DevOps Engineer Should Know

This guide compiles the 20 most common Linux and Kubernetes troubleshooting commands, illustrating typical outputs and step‑by‑step diagnostic reasoning for high CPU load, disk pressure, network failures, pod crashes, node issues, service outages, database errors, and application performance problems.

Full-Stack DevOps & Kubernetes
Full-Stack DevOps & Kubernetes
Full-Stack DevOps & Kubernetes
20 Essential Linux & Kubernetes Troubleshooting Commands Every DevOps Engineer Should Know

Linux Troubleshooting Commands

1.1 System load too high

top - 15:45:21 up 10 days, 4:35, 1 user, load average: 5.73, 4.65, 3.84
Tasks: 195 total, 1 running, 194 sleeping, 0 stopped, 0 zombie
%Cpu(s): 28.2 us, 3.3 sy, 0.0 ni, 68.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 8192.0 total, 2050.4 free, 6123.8 used, 1018.0 buff/cache
MiB Swap: 2048.0 total, 1500.0 free, 548.0 used. 2075.2 avail Mem

Identify processes consuming excessive CPU or memory with top. Terminate offending processes using kill or investigate memory leaks.

1.2 Disk space shortage

df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1        50G   45G   3.0G  94% /
tmpfs            16G  1.1G   15G   7% /dev/shm
/dev/sdb1        20G   18G   1.2G  94% /mnt/data
du -sh /var/log/*
1.3G    /var/log/syslog
500M    /var/log/auth.log
200M    /var/log/kern.log

Use df -h to view filesystem usage and du -sh to locate large files. Delete unnecessary files or expand storage.

1.3 Network connectivity issues

ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=55 time=13.4 ms
...
traceroute 8.8.8.8
traceroute to 8.8.8.8 (8.8.8.8), 30 hops max, 60 byte packets
 1  192.168.1.1 (192.168.1.1)  0.435 ms  0.425 ms  0.400 ms
 2  * * *
 3  10.3.1.1 (10.3.1.1)  5.089 ms  5.065 ms  5.063 ms
 4  8.8.8.8 (8.8.8.8)  12.539 ms  12.533 ms  12.510 ms

If ping fails, verify local network configuration. Use traceroute to detect routing problems or firewalls.

Kubernetes Troubleshooting Commands

2.1 Pod fails to start (CrashLoopBackOff)

kubectl describe pod pod-name
Name:           my-pod
Namespace:      default
Node:           node-1/192.168.1.100
Containers:
  my-container:
    Container ID:   docker://d4f2e3a6b8db
    Image:          my-app:v1.0
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
kubectl logs pod-name
Error: java.lang.ExceptionInInitializerError: Unable to initialize the application.

Examine container logs with kubectl logs and fix configuration, environment variables, or missing dependencies that cause the crash.

2.2 Node reports NotReady

kubectl get nodes
NAME      STATUS    ROLES   AGE   VERSION
node-1    NotReady  none    10d   v1.23.4
node-2    Ready     none    12d   v1.23.4
kubectl describe node node-1
Name:               node-1
Conditions:
  Type    Status  LastHeartbeatTime            Reason               Message
  Ready   False   Mon, 27 Nov 2024 11:15:00 -0500  KubeletNotReady      Kubelet stopped posting node status.
  OutOfDisk True   Mon, 27 Nov 2024 11:10:00 -0500  NodeHasNoDiskPressure  Node has no disk pressure
journalctl -u kubelet
Nov 27 11:10:00 node-1 kubelet[1375]: node "node-1" has disk pressure, evicting pods

NotReady is often caused by resource pressure (disk, memory). Use kubectl describe node and journalctl -u kubelet to pinpoint the issue.

2.3 Service unreachable

kubectl get svc
NAME         TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)   AGE
my-service   ClusterIP   10.96.0.1      none          80/TCP    12d
kubectl describe svc my-service
Name:              my-service
Namespace:         default
Labels:            app=my-app
Selector:          app=my-app
Type:              ClusterIP
IP:                10.96.0.1
Port:              80/TCP
Endpoints:         10.1.1.2:80,10.1.1.3:80

Verify the service type, port, and that the listed endpoints correspond to healthy pods.

Database Troubleshooting Commands

3.1 Connection failure

mysql -u root -p
Enter password:
ERROR 1045 (28000): Access denied for user 'root'@'localhost' (using password: YES)

Check credentials, user privileges, and ensure the MySQL server is running and reachable.

3.2 Table corruption

mysqlcheck -u root -p --auto-repair --check --optimize
+------------+-------+---------+----------+---------+
| Table      | Op    | Msg_type| Msg_text | Errors  |
+------------+-------+---------+----------+---------+
| mydb.table | check | Warning | Found row with wrong checksum |
...

Run mysqlcheck to detect and repair damaged tables; maintain regular backups.

3.3 Performance bottlenecks

SHOW PROCESSLIST;
Id   User  Host        db   Command Time State      Info
1234 app   localhost   mydb Query   10   Sending data SELECT * FROM large_table WHERE ...
1235 app   localhost   mydb Query   20   Sorting      SELECT * FROM another_table ORDER BY ...

Identify long‑running queries, analyze them with EXPLAIN, and monitor CPU, memory, and I/O usage.

3.4 High load

SHOW GLOBAL STATUS LIKE 'Threads_running';
+-----------------+-------+
| Variable_name   | Value |
+-----------------+-------+
| Threads_running | 250   |
+-----------------+-------+
SHOW ENGINE INNODB STATUS\G
------------------------
LATEST DETECTED DEADLOCK
------------------------
*** (1) TRANSACTION:
TRANSACTION 12345, ACTIVE 10 sec fetching rows
...

Reduce concurrent threads, resolve deadlocks by adjusting isolation levels, and optimize schema and indexes.

Application Troubleshooting Commands

4.1 Application crash (Java OOM)

journalctl -u myapp.service
Nov 27 12:00:00 myserver myapp[1234]: Error: OutOfMemoryError: Java heap space
Nov 27 12:01:00 myserver myapp[1234]: Service stopped unexpectedly.

Increase JVM heap limits (e.g., -Xmx, -Xms) and use profiling tools such as VisualVM or MAT to detect memory leaks.

4.2 Slow startup

strace -p PID
open("/var/lib/myapp/config.yaml", O_RDONLY) = -1 ENOENT (No such file or directory)

Check that required configuration files exist, have correct permissions, and streamline the startup script.

4.3 Performance degradation

top
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1234 myuser 20 0 1.2g 500m 30m R 95.0 6.3 10:45.43 java

Profile CPU and memory usage, use APM tools (e.g., JProfiler, New Relic), and verify downstream services are not bottlenecks.

4.4 Log overflow

du -sh /var/log/myapp.log
20G    /var/log/myapp.log

Enable log rotation (e.g., logrotate), lower log verbosity, and periodically purge old logs to free disk space.

PerformancedatabaseKubernetesLinuxTroubleshootingSystem Administration
Full-Stack DevOps & Kubernetes
Written by

Full-Stack DevOps & Kubernetes

Focused on sharing DevOps, Kubernetes, Linux, Docker, Istio, microservices, Spring Cloud, Python, Go, databases, Nginx, Tomcat, cloud computing, and related technologies.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.