Operations 12 min read

A Structured Approach to Online System Issue Diagnosis and Recovery

This article outlines a systematic methodology for understanding, evaluating, and quickly resolving production system incidents by categorizing system layers, assessing impact, employing Linux diagnostic tools, and designing fault‑tolerant mechanisms to minimize downtime and maintain core functionality.

Architecture Digest
Architecture Digest
Architecture Digest
A Structured Approach to Online System Issue Diagnosis and Recovery

Front‑line developers often face online incidents without a clear analysis process, leading to wasted time and potential revenue loss. This article summarizes practical experience and proposes a repeatable problem‑location workflow to help teams identify root causes faster.

What the article covers: A concise pattern for diagnosing production issues, focusing on system understanding, impact assessment, rapid recovery, and post‑mortem design.

What it does not cover: Detailed Linux command tutorials and exhaustive solutions for every possible case; the methods are most relevant to Java‑based web systems.

Understanding Your System

Identify whether a symptom qualifies as a system problem by considering the scale and normal operating metrics of your service. Systems can be divided into three layers:

System layer : hardware and network resources (CPU, memory, disk, I/O, deployment topology).

Software layer : runtime environment, load balancers, JDK version, web server (Tomcat), JVM parameters, databases, caches.

Application layer : business logic, API response times, QPS, concurrency limits.

Knowing these details determines how quickly you can spot anomalies before they affect users.

Evaluating Problem Impact

Determine the scope of affected users and whether the issue is global or isolated to a single node. Impact guides prioritization.

Common sources of incident information include:

System/Business monitoring alerts : Typically indicate serious incidents that can be reproduced.

Related system fault tracing : Shows downstream effects and may uncover hidden dependencies.

Production incident reports (customer service) : Often stem from user complaints; reproducing the symptom is key.

Proactive discovery : Monitoring or logs reveal transient anomalies that may not require immediate action.

Rapid Recovery

When a bug is confirmed, two broad recovery strategies exist:

Cannot quickly locate root cause : Rollback to previous version. Restart services (e.g., high CPU or connection spikes). Scale out resources if load is excessive.

Root cause identified : Apply a temporary workaround or degrade functionality.

Regardless of method, preserve the incident context for later analysis. Typical diagnostic steps include:

Run top and sort by CPU usage (Shift+P) to find the most resource‑intensive process.

Run free -m to check memory usage; if high, use top sorted by memory (Shift+M).

Inspect suspect processes with ps xuf | grep java to capture detailed information.

Collect thread dumps via jstack <PID> >jstack.log , repeating as needed.

Check GC utilization with jstat -gcutil <PID> ; if near 100%, capture heap snapshot using jmap <PID> >jmap.log .

Location and Fix

Typical failure causes are illustrated in the following diagram:

Most incidents manifest as abnormal metrics in one or more system modules. Each module has associated diagnostic tools:

CPU: top -Hp

Memory: free -m

I/O: iostat

Disk: df -h

Network: netstat

GC: jstat -gcutil

Threads: jstack

Java heap: jmap

Auxiliary: MAT, btrace, jprofile

Further details on tool usage are omitted for brevity.

Methodology

With system layers and tools defined, the troubleshooting process can be abstracted into three steps:

Inspect each layer to confirm the observed symptom.

Locate the offending process based on the symptom.

Analyze thread and memory state to pinpoint the trigger.

The workflow is visualized below:

Designing for Failure

As systems grow, failures become inevitable. Instead of only reacting, design mechanisms to minimize loss and keep core functions available:

Reasonable timeout settings : Abort slow third‑party calls and internal RPCs to prevent request buildup.

Service degradation : Switch to simpler implementations or return default values for non‑critical APIs.

Proactive discard : Drop excessively slow non‑essential third‑party responses.

All degradation or discard strategies should be paired with appropriate retry logic to restore normal operation once the dependency recovers.

Author Bio

Sun Si, head of transaction systems at Zhuanzhuan, graduated from Beihang University’s VR lab in 2008, and has extensive experience in large‑scale e‑commerce platforms and distributed system design.

Source: https://mp.weixin.qq.com/s/4HTW3BmW1HfGyVmomO1Wrw
backendoperationsperformance monitoringincident managementsystem troubleshootingLinux tools
Architecture Digest
Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.