Operations 17 min read

How Big Tech Analyzes System Performance: The Proven RESAR 7‑Step Method

The article presents the RESAR seven‑step performance analysis method used by large tech companies, detailing how to build a performance‑analysis decision tree, collect and correlate system counters, and combine global and targeted monitoring to uncover bottleneck evidence chains with concrete Linux commands and diagrams.

JavaEdge

Apr 2, 2023

How Big Tech Analyzes System Performance: The Proven RESAR 7‑Step Method

RESAR performance‑analysis methodology

RESAR (Reliability‑Engineering‑System‑Analysis‑Review) is a seven‑step method for any performance‑related case. It relies on two core concepts: a performance‑analysis decision tree and a performance‑bottleneck evidence chain .

1. Build the performance‑analysis decision tree

The decision tree is a hierarchical map of Component → Module → Counter . It is used both when designing monitoring and when diagnosing bottlenecks.

1.1 Steps to construct the tree

List all system components based on the architecture.

For each component, identify important modules (e.g., OS: CPU, memory, swap, I/O, etc.).

Enumerate the performance counters for each module. Example for the CPU module:

# top
top - 00:38:51 up 28 days, 4:27, 3 users, load average: 78.07, 62.23, 39.14
%Cpu0  : 4.2 us, 95.4 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.4 st
%Cpu1  : 1.8 us, 98.2 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st

The counters are us, sy, ni, id, wa, hi, si, st and load average.

Derive load average from vmstat (runnable + uninterruptible processes):

# vmstat
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0     0 5067124 178136 1701100    0    0     0     5    2   10  0  0 100  0  0

Additional CPU counters ( %guest, %gnice) are available via mpstat:

# mpstat -P ALL 2
Linux 3.10.0-1062.4.1.el7.x86_64 (host) 04/02/2023 _x86_64_ (4 CPU)
14:00:36  CPU   %usr   %nice   %sys %iowait %irq %soft %steal %guest %gnice %idle
14:00:38  all    5.13    0.00    3.21    0.00   0.00   0.26    0.00    0.00    0.00   91.40

Draw relationships between counters to understand how they influence each other (e.g., CPU usage ↔ load average ↔ runnable processes).

2. Build the performance‑bottleneck evidence chain

The evidence chain records the logical steps from an observed metric to the root‑cause, providing a traceable diagnosis.

3. Global monitoring analysis example

On a 24‑CPU host, a top snapshot shows most CPUs with low %us, but CPU 22 has a high %si (soft‑interrupt) value of 21.4%.

This anomaly suggests that soft‑interrupt processing is concentrated on a single CPU, prompting a deeper investigation.

4. Targeted monitoring analysis

Inspect /proc/softirqs and filter the output. The NET_RX module dominates soft‑interrupt counts on CPU 22.

Check the receive queue count via /sys/class/net/<iface>/queues/. A single queue forces all receive interrupts onto one CPU.

Solution: increase the number of receive queues so that interrupt handling can be distributed across multiple CPUs. For virtual machines, add a multiqueue setting in the KVM XML configuration; for bare‑metal hosts, replace the NIC with a model that supports multiple queues.

5. Data‑collection considerations

Monitoring tools rarely cover every counter. For Linux, a common stack is Prometheus + Grafana + node_exporter, which captures most system metrics but may miss network‑queue or memory‑error counters. Before relying on a tool, compare the counters required by the decision tree with those exposed by the tool; supplement missing data with direct commands such as top, vmstat, mpstat, cat /proc/softirqs, and cat /sys/class/net/.../queues.

6. RESAR seven‑step summary

Construct the performance‑analysis decision tree (Component → Module → Counter).

Collect real‑time values for all counters in the tree.

Perform global monitoring to spot the first‑level symptom.

Conduct targeted monitoring to drill down to the responsible module.

Record each logical step in an evidence chain.

Iterate to deeper levels if the root cause remains unclear.

Apply remediation based on the evidence chain (e.g., increase network receive queues).

Following these steps ensures a systematic, traceable, and repeatable performance‑analysis process.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

performance-analysis CPU profiling system metrics Linux monitoring RESAR Method

Written by

JavaEdge

First‑line development experience at multiple leading tech firms; now a software architect at a Shanghai state‑owned enterprise and founder of Programming Yanxuan. Nearly 300k followers online; expertise in distributed system design, AIGC application development, and quantitative finance investing.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.