Operations 17 min read

Process Hotspot Tracing and Performance Analysis with Sysom

This article explains the concept of process hotspot tracing, analyzes common performance pain points in cloud‑native environments, and details Sysom's solution—including stack unwinding, symbol resolution, flame‑graph generation, and real‑world case studies—to help developers and operators quickly locate and resolve system bottlenecks.

Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Process Hotspot Tracing and Performance Analysis with Sysom

1. Background A process hotspot or a hotspot within a process (such as a function, code segment, or thread) consumes a large amount of system resources (CPU time, memory, disk I/O, etc.) or executes at a very high frequency, becoming a performance bottleneck. Identifying hotspots is crucial for performance analysis and optimization.

Process hotspot tracing uses profiling tools and visualizations (e.g., flame graphs) to quickly locate performance bottlenecks and resource‑consuming areas, providing strong support for optimization and fault diagnosis in complex modern systems.

2. Business Pain‑Point Analysis

2.1 Pain Point One: Process Performance Bottlenecks Causing Business Anomalies In cloud‑native and containerized deployments, performance bottlenecks can significantly increase response times under high concurrency, consume excessive CPU or memory, and even lead to service unavailability. Process hotspot tracing generates call graphs and hotspot analyses to help developers and ops quickly identify these bottlenecks.

2.2 Pain Point Two: Intermittent Jitters with Hard‑to‑Trace Root Causes When an issue occurs, manually running perf may miss the optimal diagnostic window. Sysom’s continuous hotspot collection consumes minimal resources, allowing operators to view historical process states and pinpoint root causes without manual intervention.

2.3 Pain Point Three: Long Diagnosis Cycles Leaving Latent Issues Without effective analysis tools, teams often apply temporary “stop‑gap” fixes without addressing the root cause, leaving systems unstable. Sysom integrates traditional methods (call‑chain analysis, diff analysis) with AI‑driven diagnostics that leverage historical data to quickly locate root causes and provide clear remediation suggestions.

3. Solution: OS Console Diagnosis

Sysom generates process hotspots through three main steps:

1. Stack Unwind : Capture detailed kernel‑mode and user‑mode call stacks, initially containing only addresses.

2. Symbol Resolution : Translate kernel and user addresses into human‑readable function names.

3. Flame‑Graph Generation : Visualize the call‑stack data as flame graphs.

Below is a comparison of stack‑unwind solutions:

Stack‑Unwind Scheme

No FP

Interpreted Language

Kernel Version Limit

Stability

Resource Overhead

perf

Supported but higher overhead

Not supported

None

High

Medium

eBPF

Supported via programmable eBPF

Supported

>=4.19

High

Low

Language‑Level Interface

Supported

None

Medium

Low

From the table, eBPF offers the best balance of flexibility and low overhead, while perf provides broad kernel compatibility, and language‑level interfaces (e.g., async‑profiler) give low‑overhead sampling for specific runtimes.

Symbol‑resolution strategies are summarized below:

Scheme

Deployment Dependency

Memory Usage

Symbol Accuracy

Local Resolution

Low

High

Medium

Remote Resolution

High (network required)

Low

High

Local resolution caches symbols on the host, consuming more memory but requiring fewer dependencies. Remote resolution fetches symbols from repositories (e.g., yum) on demand, reducing memory usage and improving accuracy when debug info is missing.

3.2 Overall Architecture The system consists of three layers: Sysom Front‑end, Sysom Agent, and Coolbpf Profiler (the eBPF‑based data collector). Sysom Front‑end provides visual analysis; the Agent orchestrates data collection and transmission; Coolbpf performs kernel‑mode stack capture and user‑mode symbol parsing.

3.3 Front‑End Display

Hotspot analysis allows users to select instance ID, process name, hotspot type, and time range, then render flame graphs for On‑CPU, Off‑CPU, memory, and lock hotspots. Hotspot comparison lets users select two instances to generate diff flame graphs, highlighting performance differences.

3.4 Case Studies

Case 1 – High Load A periodic load spike was correlated with a CPU hotspot surge around 14:15. Flame‑graph analysis revealed the function native_queued_spin_lock_slowpath as the top hotspot, indicating lock contention. Further investigation traced the contention to tasklist_lock (both read and write locks) and identified the lock holder via call‑chain inspection.

Case 2 – Network Timeout Intermittent network timeouts were traced to the function nft_do_chain . The hotspot corresponded to an overloaded netfilter rule set (over 12,000 rules), suggesting that excessive firewall rules slowed packet processing.

Case 3 – High Process CPU Usage A shell script exhibited high CPU consumption. Flame‑graph analysis showed the hotspot at shell_execve , indicating the script was stuck in a loop repeatedly invoking ps and awk . Using strace , the repeated system calls were confirmed, allowing developers to locate and fix the problematic loop in the script source.

For any questions or feedback, scan the QR code or join the DingTalk group (ID: 94405014449) to discuss further.

LinuxeBPFPerformance analysisflamegraphprocess profilingSysom
Alibaba Cloud Infrastructure
Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.