Process Hotspot Tracing and Performance Analysis with Sysom
This article explains the concept of process hotspot tracing, analyzes common performance pain points in cloud‑native environments, and details Sysom's solution—including stack unwinding, symbol resolution, flame‑graph generation, and real‑world case studies—to help developers and operators quickly locate and resolve system bottlenecks.
1. Background A process hotspot or a hotspot within a process (such as a function, code segment, or thread) consumes a large amount of system resources (CPU time, memory, disk I/O, etc.) or executes at a very high frequency, becoming a performance bottleneck. Identifying hotspots is crucial for performance analysis and optimization.
Process hotspot tracing uses profiling tools and visualizations (e.g., flame graphs) to quickly locate performance bottlenecks and resource‑consuming areas, providing strong support for optimization and fault diagnosis in complex modern systems.
2. Business Pain‑Point Analysis
2.1 Pain Point One: Process Performance Bottlenecks Causing Business Anomalies In cloud‑native and containerized deployments, performance bottlenecks can significantly increase response times under high concurrency, consume excessive CPU or memory, and even lead to service unavailability. Process hotspot tracing generates call graphs and hotspot analyses to help developers and ops quickly identify these bottlenecks.
2.2 Pain Point Two: Intermittent Jitters with Hard‑to‑Trace Root Causes When an issue occurs, manually running perf may miss the optimal diagnostic window. Sysom’s continuous hotspot collection consumes minimal resources, allowing operators to view historical process states and pinpoint root causes without manual intervention.
2.3 Pain Point Three: Long Diagnosis Cycles Leaving Latent Issues Without effective analysis tools, teams often apply temporary “stop‑gap” fixes without addressing the root cause, leaving systems unstable. Sysom integrates traditional methods (call‑chain analysis, diff analysis) with AI‑driven diagnostics that leverage historical data to quickly locate root causes and provide clear remediation suggestions.
3. Solution: OS Console Diagnosis
Sysom generates process hotspots through three main steps:
1. Stack Unwind : Capture detailed kernel‑mode and user‑mode call stacks, initially containing only addresses.
2. Symbol Resolution : Translate kernel and user addresses into human‑readable function names.
3. Flame‑Graph Generation : Visualize the call‑stack data as flame graphs.
Below is a comparison of stack‑unwind solutions:
Stack‑Unwind Scheme
No FP
Interpreted Language
Kernel Version Limit
Stability
Resource Overhead
perf
Supported but higher overhead
Not supported
None
High
Medium
eBPF
Supported via programmable eBPF
Supported
>=4.19
High
Low
Language‑Level Interface
–
Supported
None
Medium
Low
From the table, eBPF offers the best balance of flexibility and low overhead, while perf provides broad kernel compatibility, and language‑level interfaces (e.g., async‑profiler) give low‑overhead sampling for specific runtimes.
Symbol‑resolution strategies are summarized below:
Scheme
Deployment Dependency
Memory Usage
Symbol Accuracy
Local Resolution
Low
High
Medium
Remote Resolution
High (network required)
Low
High
Local resolution caches symbols on the host, consuming more memory but requiring fewer dependencies. Remote resolution fetches symbols from repositories (e.g., yum) on demand, reducing memory usage and improving accuracy when debug info is missing.
3.2 Overall Architecture The system consists of three layers: Sysom Front‑end, Sysom Agent, and Coolbpf Profiler (the eBPF‑based data collector). Sysom Front‑end provides visual analysis; the Agent orchestrates data collection and transmission; Coolbpf performs kernel‑mode stack capture and user‑mode symbol parsing.
3.3 Front‑End Display
Hotspot analysis allows users to select instance ID, process name, hotspot type, and time range, then render flame graphs for On‑CPU, Off‑CPU, memory, and lock hotspots. Hotspot comparison lets users select two instances to generate diff flame graphs, highlighting performance differences.
3.4 Case Studies
Case 1 – High Load A periodic load spike was correlated with a CPU hotspot surge around 14:15. Flame‑graph analysis revealed the function native_queued_spin_lock_slowpath as the top hotspot, indicating lock contention. Further investigation traced the contention to tasklist_lock (both read and write locks) and identified the lock holder via call‑chain inspection.
Case 2 – Network Timeout Intermittent network timeouts were traced to the function nft_do_chain . The hotspot corresponded to an overloaded netfilter rule set (over 12,000 rules), suggesting that excessive firewall rules slowed packet processing.
Case 3 – High Process CPU Usage A shell script exhibited high CPU consumption. Flame‑graph analysis showed the hotspot at shell_execve , indicating the script was stuck in a loop repeatedly invoking ps and awk . Using strace , the repeated system calls were confirmed, allowing developers to locate and fix the problematic loop in the script source.
For any questions or feedback, scan the QR code or join the DingTalk group (ID: 94405014449) to discuss further.
Alibaba Cloud Infrastructure
For uninterrupted computing services
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.