How Big Tech Analyzes System Performance: The Proven RESAR 7‑Step Method
The article presents the RESAR seven‑step performance analysis method used by large tech companies, detailing how to build a performance‑analysis decision tree, collect and correlate system counters, and combine global and targeted monitoring to uncover bottleneck evidence chains with concrete Linux commands and diagrams.
RESAR performance‑analysis methodology
RESAR (Reliability‑Engineering‑System‑Analysis‑Review) is a seven‑step method for any performance‑related case. It relies on two core concepts: a performance‑analysis decision tree and a performance‑bottleneck evidence chain .
1. Build the performance‑analysis decision tree
The decision tree is a hierarchical map of Component → Module → Counter . It is used both when designing monitoring and when diagnosing bottlenecks.
1.1 Steps to construct the tree
List all system components based on the architecture.
For each component, identify important modules (e.g., OS: CPU, memory, swap, I/O, etc.).
Enumerate the performance counters for each module. Example for the CPU module:
# top
top - 00:38:51 up 28 days, 4:27, 3 users, load average: 78.07, 62.23, 39.14
%Cpu0 : 4.2 us, 95.4 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.4 st
%Cpu1 : 1.8 us, 98.2 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 stThe counters are us, sy, ni, id, wa, hi, si, st and load average.
Derive load average from vmstat (runnable + uninterruptible processes):
# vmstat
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
1 0 0 5067124 178136 1701100 0 0 0 5 2 10 0 0 100 0 0Additional CPU counters ( %guest, %gnice) are available via mpstat:
# mpstat -P ALL 2
Linux 3.10.0-1062.4.1.el7.x86_64 (host) 04/02/2023 _x86_64_ (4 CPU)
14:00:36 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
14:00:38 all 5.13 0.00 3.21 0.00 0.00 0.26 0.00 0.00 0.00 91.40Draw relationships between counters to understand how they influence each other (e.g., CPU usage ↔ load average ↔ runnable processes).
2. Build the performance‑bottleneck evidence chain
The evidence chain records the logical steps from an observed metric to the root‑cause, providing a traceable diagnosis.
3. Global monitoring analysis example
On a 24‑CPU host, a top snapshot shows most CPUs with low %us, but CPU 22 has a high %si (soft‑interrupt) value of 21.4%.
This anomaly suggests that soft‑interrupt processing is concentrated on a single CPU, prompting a deeper investigation.
4. Targeted monitoring analysis
Inspect /proc/softirqs and filter the output. The NET_RX module dominates soft‑interrupt counts on CPU 22.
Check the receive queue count via /sys/class/net/<iface>/queues/. A single queue forces all receive interrupts onto one CPU.
Solution: increase the number of receive queues so that interrupt handling can be distributed across multiple CPUs. For virtual machines, add a multiqueue setting in the KVM XML configuration; for bare‑metal hosts, replace the NIC with a model that supports multiple queues.
5. Data‑collection considerations
Monitoring tools rarely cover every counter. For Linux, a common stack is Prometheus + Grafana + node_exporter, which captures most system metrics but may miss network‑queue or memory‑error counters. Before relying on a tool, compare the counters required by the decision tree with those exposed by the tool; supplement missing data with direct commands such as top, vmstat, mpstat, cat /proc/softirqs, and cat /sys/class/net/.../queues.
6. RESAR seven‑step summary
Construct the performance‑analysis decision tree (Component → Module → Counter).
Collect real‑time values for all counters in the tree.
Perform global monitoring to spot the first‑level symptom.
Conduct targeted monitoring to drill down to the responsible module.
Record each logical step in an evidence chain.
Iterate to deeper levels if the root cause remains unclear.
Apply remediation based on the evidence chain (e.g., increase network receive queues).
Following these steps ensures a systematic, traceable, and repeatable performance‑analysis process.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
JavaEdge
First‑line development experience at multiple leading tech firms; now a software architect at a Shanghai state‑owned enterprise and founder of Programming Yanxuan. Nearly 300k followers online; expertise in distributed system design, AIGC application development, and quantitative finance investing.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
