Operations 11 min read

How to Diagnose and Fix Online System Issues Efficiently

This article shares practical methods for frontline engineers to quickly understand, assess, and resolve online system problems by categorizing system layers, evaluating impact, using essential Linux monitoring tools, and applying systematic troubleshooting and design‑for‑failure strategies to minimize downtime.

21CTO
21CTO
21CTO
How to Diagnose and Fix Online System Issues Efficiently

Preface

Front‑line engineers often face online incidents without a clear analysis method, leading to wasted time and potential system loss.

This article summarizes the author’s experience and proposes a structured problem‑location process to help identify key issues faster.

Understand Your System

Determine what constitutes a "system problem" by considering the scale and function of your service. Quickly discovering issues requires deep knowledge of the system, which can be divided into three layers:

System layer : hardware and network resources (CPU, disk, memory, network I/O), deployment model (distributed or single‑machine), number of cores, physical or virtual machines, memory and disk sizes, NIC specifications.

Software layer : software environment such as load balancers, JDK version, web server (e.g., Tomcat) and JVM parameters, database and cache products.

Application layer : application‑level metrics like average response time, QPS, and concurrent request limits.

Answering these questions determines how familiar you are with the system and how quickly you can spot problems before they cause impact.

Assess Impact Scope

Evaluate how many users are affected, the severity of the impact, and whether the issue is global or isolated to a single node. The source of problem information can be:

System/Business monitoring alerts : Indicates serious incidents that need immediate attention and are usually reproducible.

Related system fault tracing : Helps prioritize based on the severity of dependent systems and may uncover hidden issues.

Production incident reports (customer reports) : Requires reproducing the reported symptom.

Proactive discovery : Using monitoring or logs to detect anomalies that may not yet affect users.

Rapid Recovery

If the issue is a system bug, immediate actions include:

Rollback to a previous version when a recent deployment caused the problem.

Restart services when CPU usage spikes or connections surge.

Scale out when load is high and restart does not help.

If the root cause is known, consider temporary workarounds or feature degradation. All actions should aim for the fastest service restoration while preserving the incident context for later analysis.

Typical diagnostic commands:

top –Hp (CPU usage)

free –m (memory usage)

ps xuf | grep java (process details)

jstack <PID> >jstack.log (thread dump)

jstat –gcutil <PID> (GC utilization)

jmap <PID> >jmap.log (heap dump)

iostat (IO)

df –h (disk)

netstat (network connections)

MAT, btrace, jprofile (advanced analysis)

Methodology

With the system divided into modules and corresponding tools, the troubleshooting process can be abstracted as:

Inspect each module to confirm the observed symptom.

Locate the problematic process based on the symptom.

Analyze threads and memory of the identified process.

This leads to the root trigger of the issue.

Design for Failures and Outages

As systems grow, failures become inevitable. Instead of only reacting, design mechanisms to minimize loss and keep core functionality available during failures.

Key design practices:

Reasonable timeout mechanisms : Set appropriate timeouts for third‑party calls and internal service interactions to avoid request pile‑up.

Service degradation : Automatically switch to a lower‑functionality implementation or return default values for non‑critical interfaces.

Proactive discard : Drop excessively slow third‑party calls that are not core, while ensuring retry mechanisms for eventual recovery.

Figure 1 – Common causes of system failures.

Figure 2 – Typical Linux troubleshooting tools.

Figure 3 – Step‑by‑step investigation to isolate the problematic process.

Figure 4 – Detailed analysis of the target process.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Performance Monitoringincident responsebackend operationsOnline Debuggingsystem troubleshootingLinux tools
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.