Operations 10 min read

Master Incident Response: Diagnose and Recover Service Outages in 15 Minutes

When a service crashes and users flood you with complaints, following a structured 15‑minute workflow—first narrowing the impact, then probing six layers (network, system, application, data, external services, security), and finally documenting the incident—lets you pinpoint and fix most outages quickly and reliably.

Xiao Liu Lab
Xiao Liu Lab
Xiao Liu Lab
Master Incident Response: Diagnose and Recover Service Outages in 15 Minutes

1. 3‑Minute Emergency: Define the Scope

Before running any commands, spend three minutes answering three questions to shrink the investigation radius:

Who is affected? Individual user, specific region, or all users? (Regional issues often start with CDN or ISP checks.)

Which part is down? Whole site, a module (payment, login), front‑end, or back‑end? (Focus on the relevant service.)

How is it failing? Continuous outage or intermittent spikes? Does it happen during traffic peaks or after a deployment? (Peak issues point to resource exhaustion; post‑deployment issues point to configuration.)

Example: Only Shenzhen users cannot place orders after a CDN change an hour ago – directly target the CDN node for that region, saving half the time.

2. 10‑Minute Layered Investigation (Six Dimensions)

Follow the order

Network → System → Application → Data → External Dependency → Security Device

, spending 1‑2 minutes on each layer.

(1) Network Layer – 1 Minute to Verify Connectivity

Server reachability: ping -c 4 SERVER_IP_OR_HOSTNAME (timeout = unreachable)

Route tracing: traceroute SERVER_IP (timeout at a hop = faulty node)

Port test: telnet SERVER_IP PORT (e.g., telnet SERVER_IP 8080, "Connection refused" = port closed)

Interface check: curl -I INTERFACE_URL (200 = OK, 4xx/5xx = error)

Typical issue: firewall blocks a port or ISP instability – temporarily open the port or switch to a backup line.

(2) System Layer – 2 Minutes to Check Resource Saturation

CPU/Load: top (sort by %CPU, >90% indicates overload) or uptime (load > number of CPU cores = pressure)

Memory: free -h (available ≈ 0 = out of memory)

Disk: df -h (usage >85% warning, >95% full) and du -sh /data/* (find large files)

Logs: journalctl -n 100 /var/log/messages (search for OOM or disk‑full messages)

Emergency actions: kill unnecessary processes ( kill -9 PID) and truncate old logs ( echo "" > LOG_FILE).

(3) Application Layer – 2 Minutes to Verify Service Health

Service status: systemctl status SERVICE_NAME (e.g., systemctl status nginx, "active" = running)

Process check: ps -ef | grep SERVICE_NAME (no output = process not started)

Port listening: ss -tulnp | grep PORT (no output = port not listening)

Log & health check: tail -f /var/log/APP.log (search for ERROR) and curl -v http://localhost:PORT/health ("UP" = healthy)

Common problem: crashed process or port conflict – restart the service or free the port.

(4) Data Layer – 2 Minutes to Assess Database/Cache Health

MySQL login test: mysql -uUSER -p -hIP (failure = service down or bad credentials)

MySQL slow/blocked queries: mysql -e "show full processlist;" (Locked = block, >60s = slow)

Redis connectivity: redis-cli -h IP ping ("PONG" = OK)

Redis slowlog: redis-cli -h IP slowlog get 5 (shows slow commands)

Emergency actions: kill blocking DB process ( kill PID) or temporarily raise max connections ( set global max_connections=2000).

(5) External Dependency Layer – 1 Minute to Verify Third‑Party Services

Third‑party API: curl -I API_URL (timeout or 5xx = external failure)

DNS resolution: nslookup DOMAIN (error = hijack or misconfiguration)

CDN cache: curl -v DOMAIN | grep cache (HIT = cached, no hit may require cache refresh)

Emergency fix: switch to a backup API endpoint or use an alternative DNS server (e.g., 223.5.5.5).

(6) Security Device Layer – 1 Minute to Check Blocking Rules

Firewall rules: iptables -L -n (look for DROP rules) and firewall-cmd --list-ports (confirm allowed ports)

WAF/IPS: consult the security team for block logs and add temporary whitelist entries for legitimate IPs/requests.

3. 2‑Minute Information Collection for Post‑Mortem

User error screenshots (error code, timestamp)

Key excerpts from system/application logs

CPU, memory, disk, and network monitoring data

Status records of services, databases, and third‑party interfaces

4. Post‑Incident Actions (3 Steps)

Root‑cause review : Identify whether the issue stemmed from resources, configuration, code, or third‑party services and assign responsibility.

Monitoring optimization : Set alerts on core metrics (CPU, connection count, API latency) via SMS, DingTalk, etc.

Script consolidation : Turn frequently used diagnostic commands into reusable scripts (e.g., a one‑click system‑resource check).

Conclusion

The essence of outage troubleshooting is not memorizing commands but “defining the scope first, then breaking through layers.” Following this 15‑minute workflow lets you locate and resolve over 80 % of incidents quickly. Save this guide and apply it next time a service goes down.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

operationsTroubleshootingSystem monitoringnetwork debuggingservice recovery
Xiao Liu Lab
Written by

Xiao Liu Lab

An operations lab passionate about server tinkering 🔬 Sharing automation scripts, high-availability architecture, alert optimization, and incident reviews. Using technology to reduce overtime and experience to avoid major pitfalls. Follow me for easier, more reliable operations!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.