Operations 12 min read

System Troubleshooting: A Structured Approach to Diagnosis, Recovery, and Failure‑Resilient Design

This article presents a systematic methodology for diagnosing and resolving online system issues, covering system understanding, impact assessment, rapid recovery techniques, detailed troubleshooting steps with Linux and Java tools, and design principles to mitigate future failures.

Zhuanzhuan Tech

Oct 17, 2017

System Troubleshooting: A Structured Approach to Diagnosis, Recovery, and Failure‑Resilient Design

Preface

Front‑line developers often have to handle online incidents, but many colleagues lack a systematic way to analyze and solve these problems, leading to wasted time and potential loss of service availability.

What this article covers: It summarizes the author’s experience in handling online issues and proposes a relatively regular problem‑location and handling pattern to help reduce investigation time and quickly identify key points for fixing.

What this article does not cover: It is not a Linux command tutorial (commands are only briefly introduced), and it does not aim to provide solutions for every possible issue; the methods mainly apply to typical Java‑based web systems.

Know Your System

Understanding the characteristics of your system is the prerequisite for fast problem discovery. Systems can be divided into three layers:

System layer : hardware and infrastructure such as CPU, disk, memory, network I/O, deployment model (distributed or single‑machine), number of CPU cores, physical or virtual machines, memory size, disk capacity, network card specifications.

Software layer : software environment including load balancers, JDK version, web server (e.g., Tomcat) and JVM parameters, database and cache products.

Application layer : the application itself, e.g., average response time of key interfaces, QPS, concurrency limits of specific endpoints.

Being able to answer questions about these aspects determines how well you know the system and how quickly you can spot problems before they cause real impact.

Assess Impact Scope

Determine how many users are affected and to what extent. For clustered systems, decide whether the issue is global or limited to a single node. Impact scope influences the priority of handling.

Information sources include:

System and business monitoring alerts : Large companies usually have monitoring systems; alerts indicate that the system is already impacted and require immediate attention.

Related system fault tracing : Identify if the problem originates from another system; this often reveals hidden dependencies that need further investigation after an urgent fix.

Production incident reports (customer service) : Issues reported by users; reproducing the phenomenon is crucial.

Proactive discovery : Use monitoring or logs to spot anomalies; verify whether they are true problems or transient glitches.

Rapid Recovery

When a system bug is confirmed, the goal is to restore service as quickly as possible. Two scenarios exist:

1. Unable to locate the root cause quickly

Rollback: Preferable when a recent version was deployed.

Restart: Used when CPU spikes or connection counts surge.

Scale out: Applied when high traffic cannot be mitigated by restart.

2. Root cause can be identified

Temporary workaround or feature degradation.

Regardless of the method, the incident must be documented. Basic diagnostic steps include:

Run top; if CPU is low, sort by CPU usage (Shift+P) and record the most resource‑intensive process.

Run free -m; if memory usage is high, run top again, sort by memory (Shift+M), and record the top consumer.

Inspect the suspect process with ps xuf | grep java and record detailed information.

Collect thread dump using jstack <PID> >jstack.log (repeat several times).

Check GC utilization with jstat –gcutil <PID>; if the Old generation approaches 100%, capture heap snapshot using jmap<PID> >jmap.log for further analysis.

Location and Fix

The following diagram shows common causes of system failures:

Most failures manifest as abnormal metrics in one or more modules. Each module has corresponding tools for analysis and localization.

Methodology

With the module division and associated tools, the troubleshooting process can be abstracted into a relatively fixed workflow:

Inspect each module sequentially to confirm the observed symptoms.

Based on the symptoms, locate the problematic process.

Further analyze threads and memory usage.

The ultimate goal is to find the trigger point of the issue.

Design for Faults and Failures

As system scale grows, failures become inevitable. Reactive measures are insufficient; designs must include mechanisms that minimize loss and keep core functions available when failures occur.

1. Reasonable timeout mechanisms

Set appropriate timeouts for third‑party network calls to avoid request pile‑up.

Configure timeouts for internal service calls.

2. Service degradation

Automatically switch to a lower‑version implementation when a service cannot respond normally.

Return default values for less critical interfaces.

3. Proactive discard

For non‑core, slow third‑party calls, discard the request outright.

Regardless of degradation or discard, a proper retry mechanism should be in place so that dependencies can recover automatically after they become healthy again.

Author Bio

Sun Si, head of the transaction system at ZuanZuan, graduated in 2008 from Beihang University’s Computer Science Virtual Reality Lab. After working at TravelSky on ticket publishing platforms, he joined the internet industry in 2010, contributing to NetEase’s e‑commerce platform, Sohu’s news client, and Qunar’s transaction and payment systems. Since April 2016, he leads the transaction system in ZuanZuan’s middle‑platform technology department, with deep expertise in large‑scale e‑commerce platforms and distributed system design.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

performance monitoring incident response system troubleshooting Java debugging Linux tools

Written by

Zhuanzhuan Tech

A platform for Zhuanzhuan R&D and industry peers to learn and exchange technology, regularly sharing frontline experience and cutting‑edge topics. We welcome practical discussions and sharing; contact waterystone with any questions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.