Master Incident Response: Diagnose and Recover Service Outages in 15 Minutes
When a service crashes and users flood you with complaints, following a structured 15‑minute workflow—first narrowing the impact, then probing six layers (network, system, application, data, external services, security), and finally documenting the incident—lets you pinpoint and fix most outages quickly and reliably.
1. 3‑Minute Emergency: Define the Scope
Before running any commands, spend three minutes answering three questions to shrink the investigation radius:
Who is affected? Individual user, specific region, or all users? (Regional issues often start with CDN or ISP checks.)
Which part is down? Whole site, a module (payment, login), front‑end, or back‑end? (Focus on the relevant service.)
How is it failing? Continuous outage or intermittent spikes? Does it happen during traffic peaks or after a deployment? (Peak issues point to resource exhaustion; post‑deployment issues point to configuration.)
Example: Only Shenzhen users cannot place orders after a CDN change an hour ago – directly target the CDN node for that region, saving half the time.
2. 10‑Minute Layered Investigation (Six Dimensions)
Follow the order
Network → System → Application → Data → External Dependency → Security Device, spending 1‑2 minutes on each layer.
(1) Network Layer – 1 Minute to Verify Connectivity
Server reachability: ping -c 4 SERVER_IP_OR_HOSTNAME (timeout = unreachable)
Route tracing: traceroute SERVER_IP (timeout at a hop = faulty node)
Port test: telnet SERVER_IP PORT (e.g., telnet SERVER_IP 8080, "Connection refused" = port closed)
Interface check: curl -I INTERFACE_URL (200 = OK, 4xx/5xx = error)
Typical issue: firewall blocks a port or ISP instability – temporarily open the port or switch to a backup line.
(2) System Layer – 2 Minutes to Check Resource Saturation
CPU/Load: top (sort by %CPU, >90% indicates overload) or uptime (load > number of CPU cores = pressure)
Memory: free -h (available ≈ 0 = out of memory)
Disk: df -h (usage >85% warning, >95% full) and du -sh /data/* (find large files)
Logs: journalctl -n 100 /var/log/messages (search for OOM or disk‑full messages)
Emergency actions: kill unnecessary processes ( kill -9 PID) and truncate old logs ( echo "" > LOG_FILE).
(3) Application Layer – 2 Minutes to Verify Service Health
Service status: systemctl status SERVICE_NAME (e.g., systemctl status nginx, "active" = running)
Process check: ps -ef | grep SERVICE_NAME (no output = process not started)
Port listening: ss -tulnp | grep PORT (no output = port not listening)
Log & health check: tail -f /var/log/APP.log (search for ERROR) and curl -v http://localhost:PORT/health ("UP" = healthy)
Common problem: crashed process or port conflict – restart the service or free the port.
(4) Data Layer – 2 Minutes to Assess Database/Cache Health
MySQL login test: mysql -uUSER -p -hIP (failure = service down or bad credentials)
MySQL slow/blocked queries: mysql -e "show full processlist;" (Locked = block, >60s = slow)
Redis connectivity: redis-cli -h IP ping ("PONG" = OK)
Redis slowlog: redis-cli -h IP slowlog get 5 (shows slow commands)
Emergency actions: kill blocking DB process ( kill PID) or temporarily raise max connections ( set global max_connections=2000).
(5) External Dependency Layer – 1 Minute to Verify Third‑Party Services
Third‑party API: curl -I API_URL (timeout or 5xx = external failure)
DNS resolution: nslookup DOMAIN (error = hijack or misconfiguration)
CDN cache: curl -v DOMAIN | grep cache (HIT = cached, no hit may require cache refresh)
Emergency fix: switch to a backup API endpoint or use an alternative DNS server (e.g., 223.5.5.5).
(6) Security Device Layer – 1 Minute to Check Blocking Rules
Firewall rules: iptables -L -n (look for DROP rules) and firewall-cmd --list-ports (confirm allowed ports)
WAF/IPS: consult the security team for block logs and add temporary whitelist entries for legitimate IPs/requests.
3. 2‑Minute Information Collection for Post‑Mortem
User error screenshots (error code, timestamp)
Key excerpts from system/application logs
CPU, memory, disk, and network monitoring data
Status records of services, databases, and third‑party interfaces
4. Post‑Incident Actions (3 Steps)
Root‑cause review : Identify whether the issue stemmed from resources, configuration, code, or third‑party services and assign responsibility.
Monitoring optimization : Set alerts on core metrics (CPU, connection count, API latency) via SMS, DingTalk, etc.
Script consolidation : Turn frequently used diagnostic commands into reusable scripts (e.g., a one‑click system‑resource check).
Conclusion
The essence of outage troubleshooting is not memorizing commands but “defining the scope first, then breaking through layers.” Following this 15‑minute workflow lets you locate and resolve over 80 % of incidents quickly. Save this guide and apply it next time a service goes down.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Xiao Liu Lab
An operations lab passionate about server tinkering 🔬 Sharing automation scripts, high-availability architecture, alert optimization, and incident reviews. Using technology to reduce overtime and experience to avoid major pitfalls. Follow me for easier, more reliable operations!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
