50 Essential Ops Troubleshooting & Fix Techniques Every Sysadmin Should Know
This guide compiles fifty practical troubleshooting and remediation techniques covering system, network, application, database, and security layers, enabling operations engineers to quickly diagnose common failures such as high load, service crashes, permission errors, and performance bottlenecks, and apply concrete fixes to maintain stable, secure services.
System Layer
Check system logs
Technique: View logs with journalctl or files under /var/log to find clues.
Fix: Adjust service configuration based on log findings and restart the service.
High‑load investigation
Technique: Use top or htop to analyze CPU, memory, and I/O usage.
Fix: Optimize heavy processes, adjust priorities, or add resources.
Memory leak detection
Technique: Run free, vmstat and valgrind to inspect memory consumption.
Fix: Restart the affected process after fixing the leak.
Disk space shortage
Technique: Check usage with df -h and locate large files using du -sh.
Fix: Delete unnecessary files, clean logs, or expand the disk.
Service fails to start
Technique: Inspect status with systemctl and review related logs.
Fix: Correct missing dependencies or configuration errors, then restart.
Kernel parameter tuning
Technique: Query and modify kernel settings via sysctl.
Fix: Optimize TCP buffers, max connections, etc., to improve performance.
Process crash analysis
Technique: Examine kernel messages with dmesg to identify crash causes.
Fix: Resolve resource exhaustion or code bugs, then restart the process.
CPU bottleneck investigation
Technique: Use mpstat or sar to check CPU usage.
Fix: Optimize application code, adjust load balancing, or add CPU cores.
Filesystem issues
Technique: Run fsck to detect filesystem errors.
Fix: Execute fsck at boot to repair the filesystem.
Excessive swap usage
Technique: Monitor swap with vmstat.
Fix: Add physical memory or adjust swap policies.
Network Layer
Network connectivity check
Technique: Use ping and traceroute to verify reachability and routing.
Fix: Correct network configuration and firewall rules.
Port occupation problems
Technique: List occupied ports with netstat or ss.
Fix: Terminate the offending process or change the application’s port.
Firewall issues
Technique: Inspect rules via iptables or firewalld.
Fix: Modify rules to open required ports.
DNS resolution problems
Technique: Query with nslookup or dig.
Fix: Verify local DNS settings or switch to a reliable DNS server.
Network congestion
Technique: Analyze traffic using iftop or nload.
Fix: Throttle heavy flows, redesign topology, or upgrade bandwidth.
TCP connection timeout
Technique: Check connection states with netstat or ss.
Fix: Adjust TCP timeout parameters and tune connection‑pool settings.
High bandwidth consumption
Technique: Monitor usage via iftop.
Fix: Limit bandwidth‑heavy processes or rebalance allocation.
ARP conflicts
Technique: View ARP table with arp -a.
Fix: Correct IP address assignments to avoid clashes.
MTU mismatch
Technique: Test with ping -M do -s to verify MTU.
Fix: Align MTU settings with network device parameters.
SSL certificate issues
Technique: Use openssl to inspect certificate status.
Fix: Renew or regenerate the certificate.
Application Layer
Application service crash
Technique: Review log files for pre‑crash entries.
Fix: Optimize configuration or correct code bugs to stabilize the service.
High concurrency bottleneck
Technique: Examine concurrent connections with netstat or sar.
Fix: Add load‑balancer nodes, refine application code and database queries.
Application deadlock
Technique: Debug with strace or gdb to locate deadlocks.
Fix: Rewrite logic to prevent conflicting concurrent operations.
Slow startup
Technique: Trace system calls using strace.
Fix: Streamline the startup sequence and reduce load time.
Oversized application logs
Technique: Periodically check log size and rotate with logrotate.
Fix: Lower log level and purge old logs regularly.
Application port conflict
Technique: Detect occupied ports via lsof or netstat.
Fix: Release the port or change the application’s port configuration.
Connection‑pool exhaustion
Technique: Look for pool‑exhaustion errors in application logs.
Fix: Increase pool size or optimise database queries.
Misconfigured application settings
Technique: Verify parameters in configuration files.
Fix: Correct the config and reload the service.
Application timeout problems
Technique: Test response time with curl or ab.
Fix: Raise timeout thresholds and speed up database queries.
Dependent service unavailable
Technique: Probe dependencies using curl or telnet.
Fix: Restart or repair the dependent service.
Database Layer
Database connection failure
Technique: Verify port accessibility, user permissions, and network reachability.
Fix: Correct permission issues or network configuration.
Slow query problems
Technique: Run EXPLAIN to view the execution plan.
Fix: Optimise SQL, add indexes or partition tables.
Database deadlock
Technique: Inspect lock status (e.g., SHOW ENGINE INNODB STATUS in MySQL).
Fix: Refactor transactions to avoid long‑running locks.
Performance bottlenecks
Technique: Use mysqltuner or built‑in monitoring tools.
Fix: Increase cache, optimise queries, or upgrade hardware.
Master‑slave replication lag
Technique: Review replication status and load on the master.
Fix: Reduce master load, add slaves, or adjust replication settings.
Table locking
Technique: Check locks with SHOW PROCESSLIST or equivalent.
Fix: Optimise queries to minimise large‑batch operations.
Backup failures
Technique: Examine backup logs to pinpoint the cause.
Fix: Expand storage or modify backup strategy.
Database I/O issues
Technique: Monitor I/O with iostat.
Fix: Deploy SSDs or add RAID arrays to improve throughput.
Insufficient tablespace
Technique: Query space usage via SHOW TABLE STATUS.
Fix: Expand tablespace and purge unused data.
Too many connections
Technique: View current connections with SHOW STATUS.
Fix: Raise max connections or fine‑tune the connection pool.
Security & Permission Management
Permission errors blocking access
Technique: Adjust file or directory rights with chmod and chown.
Fix: Set appropriate permissions for each user.
SSH login failures
Technique: Review /var/log/auth.log or journalctl for error details.
Fix: Correct SSH configuration and firewall rules.
Brute‑force protection
Technique: Deploy tools like fail2ban to monitor suspicious attempts.
Fix: Configure auto‑ban policies to shield the server.
Overly strict firewall rules
Technique: Examine rules via iptables or firewalld.
Fix: Open required ports and balance policies.
Regular password changes
Technique: Enforce periodic password policies.
Fix: Require users to update passwords on schedule.
Log auditing
Technique: Use auditd to capture user activity.
Fix: Regularly review logs for anomalies.
File integrity checks
Technique: Run tripwire or aide to verify integrity.
Fix: Respond to alerts by repairing or reporting issues.
Application vulnerability scanning
Technique: Scan with OpenVAS or Nessus.
Fix: Patch identified vulnerabilities promptly.
ACL management
Technique: View and modify ACLs using setfacl.
Fix: Apply sensible access controls to prevent privilege abuse.
Logrotate failures
Technique: Check logrotate configuration for errors.
Fix: Adjust rotation policies to ensure logs are archived.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Liangxu Linux
Liangxu, a self‑taught IT professional now working as a Linux development engineer at a Fortune 500 multinational, shares extensive Linux knowledge—fundamentals, applications, tools, plus Git, databases, Raspberry Pi, etc. (Reply “Linux” to receive essential resources.)
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
