Operations 15 min read

Efficiently Resolving Performance Bottlenecks and Jitter with Process Hotspot Tracing in Alibaba Cloud OS Console

The article explains how Alibaba Cloud's SysOM console uses low‑overhead process hotspot tracing, stack unwinding, symbol resolution, eBPF and AI diagnostics to pinpoint CPU, memory, lock and network issues, offering visual flame‑graph analysis and real‑world case studies for faster root‑cause identification.

Linux Kernel Journey
Linux Kernel Journey
Linux Kernel Journey
Efficiently Resolving Performance Bottlenecks and Jitter with Process Hotspot Tracing in Alibaba Cloud OS Console

Process Hotspot Definition

A process hotspot is a region of a process—function, code block, or thread—that consumes a disproportionate amount of CPU time, memory, I/O, or executes at very high frequency, becoming a performance bottleneck.

Technical Challenges

High‑frequency, intermittent jitter is easily missed by manual tools such as perf.

Many binaries are compiled without a frame pointer (FP), preventing traditional FP‑based stack unwinding.

Interpreted languages (Java, Python, etc.) have custom stack‑frame layouts that ordinary profilers cannot decode.

Solution Overview

SysOM implements continuous, low‑overhead hotspot tracing in three stages: stack unwinding, symbol resolution, and flame‑graph generation.

Stack Unwinding

Three approaches are compared:

perf – works on all kernel versions, incurs high overhead, cannot unwind interpreted languages.

eBPF – programmable, supports no‑FP binaries and interpreted languages, requires kernel ≥ 4.19, offers high stability and low overhead.

Language‑level interfaces (e.g., JVM TI) – low overhead, no kernel constraints, but intrusive and may affect stability.

Symbol Resolution

Two strategies are provided:

Local resolution – minimal deployment dependencies, high memory usage for symbol caching, moderate accuracy (depends on presence of debuginfo on the host).

Remote resolution – fetches debuginfo from package repositories, low memory footprint, high accuracy.

Architecture

SysOM Frontend – visual UI for hotspot analysis, hotspot comparison, and CPU/GPU heatmaps.

SysOM Agent – collects data via the Coolbpf profiler and forwards results; includes modules for OnCpu, OffCpu, Memory, and Lock hotspots.

Coolbpf Profiler – eBPF‑based library that captures kernel‑ and user‑space stacks and performs user‑mode symbol resolution for compiled and interpreted languages.

Frontend Workflow

Select instance ID, process name, hotspot type (OnCpu, OffCpu, Memory, Lock), and time range; click “Execute Hotspot Trace”.

For OnCpu, a flame‑graph of CPU hotspots is rendered.

For Memory, a memory‑usage flame‑graph is rendered.

Hotspot comparison requires two instances and produces a differential flame‑graph that highlights changes between normal and abnormal periods.

Case Studies

Case 1 – Load Spike (>10× normal)

The CPU flame‑graph showed a sudden increase in native_queued_spin_lock_slowpath, indicating lock contention. Drill‑down revealed repeated waits on tasklist_lock. Call‑stack analysis identified the kernel lock holder, confirming that the lock was the root cause of the load spike.

Case 2 – Intermittent Network Timeout

Hotspot tracing highlighted nft_do_chain, the netfilter rule‑processing function. Inspection of the netfilter rule set revealed >12,000 rules, confirming that rule‑processing overhead caused the timeout.

Case 3 – High CPU Usage in a Shell Script

The flame‑graph highlighted shell_execve as the hotspot, suggesting a runaway loop. Using strace, repeated ps and awk executions were observed. Searching the script source for these commands pinpointed the offending loop, which was then removed.

Key Technical Details

eBPF stack unwinding works without FP and supports Java/Python stack frames via custom parsers.

Kernel version requirement for eBPF‑based tracing: ≥ 4.19.

Perf incurs medium resource overhead (labeled “medium” in the comparison) and cannot unwind interpreted languages.

Language‑level sampling tools (e.g., async‑profiler) have low overhead but are intrusive and may cause rare crashes.

SysOM prioritises eBPF as the primary stack‑unwinding method, falls back to perf, and uses language‑level interfaces as a last resort.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Cloud NativeeBPFperformance analysisSysOMAI diagnosticsprocess hotspot tracing
Linux Kernel Journey
Written by

Linux Kernel Journey

Linux Kernel Journey

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.