How to Reveal Tracing Blind Spots with Continuous Profiling and Code Hotspots
This article explains the evolution of observability, outlines a step‑by‑step diagnosis workflow using metrics, logs and tracing, highlights the blind spots of traditional tracing, and demonstrates how Alibaba Cloud ARMS continuous profiling and code‑hotspot features can pinpoint slow call‑chain issues in Java applications.
Since Google’s seminal Dapper paper, the industry has converged on a three‑pillar observability stack—Metrics, Tracing, and Logging—that together form the de‑facto standard for diagnosing distributed systems.
Full‑stack Diagnosis Workflow
Using this stack, a typical problem‑resolution process follows three steps:
Detect anomalies via pre‑configured alerts from Metrics or Logs and identify the affected module.
Query and analyse the related logs to locate the core error message.
Leverage detailed tracing data to pinpoint the exact code segment responsible for the issue.
Beyond rapid post‑incident root‑cause analysis, a comprehensive observability solution can also surface problems before they cause major outages.
Tracing Blind Spots
Tracing relies on Java agents or SDKs that instrument popular frameworks (HTTP, RPC, databases, MQ, etc.). When a slow path lies in uninstrumented business logic, the trace shows a long‑lasting span without a corresponding method, making accurate latency attribution impossible.
Example code illustrates this gap:
public String demo() throws SQLException {
// Simulated 1000 ms business delay
take1000ms(1000);
// Database query
stmt = conn.createStatement();
ResultSet rs = stmt.executeQuery("SELECT * FROM table");
return "Hello ARMS!";
}
private void take1000ms(long time) {
try {
Thread.sleep(time);
} catch (InterruptedException e) {
e.printStackTrace();
}
}Tracing tools typically capture the database calls (lines 6‑7) but miss the artificial delay in line 4, aggregating its latency into the surrounding Spring Boot method.
Limitations of Arthas Trace Command
Scope limited : Works only for reproducible, stable scenarios; cannot handle GC spikes, resource contention, or network issues.
High usage barrier : Requires deep familiarity with the codebase to manually issue trace commands on specific methods.
High investigation cost : Complex multi‑service call chains demand repeated manual tracing across instances, making the process cumbersome.
Consequently, while Arthas can help in simple cases, it falls short for intricate, multi‑hop latency problems.
ARMS Continuous Profiling (CP) Solution
Alibaba Cloud ARMS combines traditional tracing, metrics, and logging with a built‑in continuous profiling capability. CP continuously samples CPU and memory stack traces via a Java Agent, aggregates them on the server, and presents three diagnostic views:
CPU & Memory Diagnosis : Flame graphs (based on Async Profiler) show on‑CPU hotspots with low overhead (≈5 % CPU, ~50 MB off‑heap).
Code Hotspots : By correlating TraceId and SpanId with sampled stacks, ARMS produces on‑ and off‑CPU flame graphs that reveal the exact business logic hidden from standard tracing.
Safety & Reliability : Low‑overhead sampling, automatic 7‑day data retention, and mitigations for known Async Profiler issues (e.g., #694, #769) make CP production‑ready.
Enabling and Using Code Hotspots
Log in to the ARMS console and navigate to Application Monitoring → Application List .
Select the target region and application.
Open Application Settings → Custom Configuration .
Enable the CPU & Memory Hotspot switch, then turn on Code Hotspot and specify the IPs or CIDR of the instances to profile.
Save the configuration.
In the console, go to Interface Call → Trace Query , select a TraceId, and view the Method Stack tab (shows only instrumented spans, e.g., MariaDB).
Switch to the Code Hotspot tab to see a flame graph that includes the previously invisible java.lang.Thread.sleep() delay (≈990 ms), confirming the missing instrumentation.
The flame graph lists each method’s self‑time, allowing engineers to focus on the widest flames to locate performance bottlenecks.
Core Characteristics
Low Overhead : Automatic trace‑based sampling keeps CPU impact around 5 % and memory usage modest.
Fine Granularity : Correlates trace identifiers with sampled stacks to expose call‑chain‑level hotspots.
Secure & Reliable : Addresses known Async Profiler risks, provides 7‑day data retention, and operates safely in production.
For further reading, see the Arthas trace command documentation, the Async Profiler project, and ARMS user guides.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
