How to Diagnose and Fix Online System Issues Efficiently
This article shares practical methods for frontline engineers to quickly understand, assess, and resolve online system problems by categorizing system layers, evaluating impact, using essential Linux monitoring tools, and applying systematic troubleshooting and design‑for‑failure strategies to minimize downtime.
Preface
Front‑line engineers often face online incidents without a clear analysis method, leading to wasted time and potential system loss.
This article summarizes the author’s experience and proposes a structured problem‑location process to help identify key issues faster.
Understand Your System
Determine what constitutes a "system problem" by considering the scale and function of your service. Quickly discovering issues requires deep knowledge of the system, which can be divided into three layers:
System layer : hardware and network resources (CPU, disk, memory, network I/O), deployment model (distributed or single‑machine), number of cores, physical or virtual machines, memory and disk sizes, NIC specifications.
Software layer : software environment such as load balancers, JDK version, web server (e.g., Tomcat) and JVM parameters, database and cache products.
Application layer : application‑level metrics like average response time, QPS, and concurrent request limits.
Answering these questions determines how familiar you are with the system and how quickly you can spot problems before they cause impact.
Assess Impact Scope
Evaluate how many users are affected, the severity of the impact, and whether the issue is global or isolated to a single node. The source of problem information can be:
System/Business monitoring alerts : Indicates serious incidents that need immediate attention and are usually reproducible.
Related system fault tracing : Helps prioritize based on the severity of dependent systems and may uncover hidden issues.
Production incident reports (customer reports) : Requires reproducing the reported symptom.
Proactive discovery : Using monitoring or logs to detect anomalies that may not yet affect users.
Rapid Recovery
If the issue is a system bug, immediate actions include:
Rollback to a previous version when a recent deployment caused the problem.
Restart services when CPU usage spikes or connections surge.
Scale out when load is high and restart does not help.
If the root cause is known, consider temporary workarounds or feature degradation. All actions should aim for the fastest service restoration while preserving the incident context for later analysis.
Typical diagnostic commands:
top –Hp (CPU usage)
free –m (memory usage)
ps xuf | grep java (process details)
jstack <PID> >jstack.log (thread dump)
jstat –gcutil <PID> (GC utilization)
jmap <PID> >jmap.log (heap dump)
iostat (IO)
df –h (disk)
netstat (network connections)
MAT, btrace, jprofile (advanced analysis)
Methodology
With the system divided into modules and corresponding tools, the troubleshooting process can be abstracted as:
Inspect each module to confirm the observed symptom.
Locate the problematic process based on the symptom.
Analyze threads and memory of the identified process.
This leads to the root trigger of the issue.
Design for Failures and Outages
As systems grow, failures become inevitable. Instead of only reacting, design mechanisms to minimize loss and keep core functionality available during failures.
Key design practices:
Reasonable timeout mechanisms : Set appropriate timeouts for third‑party calls and internal service interactions to avoid request pile‑up.
Service degradation : Automatically switch to a lower‑functionality implementation or return default values for non‑critical interfaces.
Proactive discard : Drop excessively slow third‑party calls that are not core, while ensuring retry mechanisms for eventual recovery.
Figure 1 – Common causes of system failures.
Figure 2 – Typical Linux troubleshooting tools.
Figure 3 – Step‑by‑step investigation to isolate the problematic process.
Figure 4 – Detailed analysis of the target process.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
