Operations 14 min read

How to Reveal Tracing Blind Spots with Continuous Profiling and Code Hotspots

This article explains the evolution of observability, outlines a step‑by‑step diagnosis workflow using metrics, logs and tracing, highlights the blind spots of traditional tracing, and demonstrates how Alibaba Cloud ARMS continuous profiling and code‑hotspot features can pinpoint slow call‑chain issues in Java applications.

Alibaba Cloud Native

Oct 21, 2023

How to Reveal Tracing Blind Spots with Continuous Profiling and Code Hotspots

Since Google’s seminal Dapper paper, the industry has converged on a three‑pillar observability stack—Metrics, Tracing, and Logging—that together form the de‑facto standard for diagnosing distributed systems.

Full‑stack Diagnosis Workflow

Using this stack, a typical problem‑resolution process follows three steps:

Detect anomalies via pre‑configured alerts from Metrics or Logs and identify the affected module.

Query and analyse the related logs to locate the core error message.

Leverage detailed tracing data to pinpoint the exact code segment responsible for the issue.

Beyond rapid post‑incident root‑cause analysis, a comprehensive observability solution can also surface problems before they cause major outages.

Tracing Blind Spots

Tracing relies on Java agents or SDKs that instrument popular frameworks (HTTP, RPC, databases, MQ, etc.). When a slow path lies in uninstrumented business logic, the trace shows a long‑lasting span without a corresponding method, making accurate latency attribution impossible.

Example code illustrates this gap:

public String demo() throws SQLException {
    // Simulated 1000 ms business delay
    take1000ms(1000);
    // Database query
    stmt = conn.createStatement();
    ResultSet rs = stmt.executeQuery("SELECT * FROM table");
    return "Hello ARMS!";
}

private void take1000ms(long time) {
    try {
        Thread.sleep(time);
    } catch (InterruptedException e) {
        e.printStackTrace();
    }
}

Tracing tools typically capture the database calls (lines 6‑7) but miss the artificial delay in line 4, aggregating its latency into the surrounding Spring Boot method.

Limitations of Arthas Trace Command

Scope limited : Works only for reproducible, stable scenarios; cannot handle GC spikes, resource contention, or network issues.

High usage barrier : Requires deep familiarity with the codebase to manually issue trace commands on specific methods.

High investigation cost : Complex multi‑service call chains demand repeated manual tracing across instances, making the process cumbersome.

Consequently, while Arthas can help in simple cases, it falls short for intricate, multi‑hop latency problems.

ARMS Continuous Profiling (CP) Solution

Alibaba Cloud ARMS combines traditional tracing, metrics, and logging with a built‑in continuous profiling capability. CP continuously samples CPU and memory stack traces via a Java Agent, aggregates them on the server, and presents three diagnostic views:

CPU & Memory Diagnosis : Flame graphs (based on Async Profiler) show on‑CPU hotspots with low overhead (≈5 % CPU, ~50 MB off‑heap).

Code Hotspots : By correlating TraceId and SpanId with sampled stacks, ARMS produces on‑ and off‑CPU flame graphs that reveal the exact business logic hidden from standard tracing.

Safety & Reliability : Low‑overhead sampling, automatic 7‑day data retention, and mitigations for known Async Profiler issues (e.g., #694, #769) make CP production‑ready.

Enabling and Using Code Hotspots

Select the target region and application.

Open Application Settings → Custom Configuration .

Enable the CPU & Memory Hotspot switch, then turn on Code Hotspot and specify the IPs or CIDR of the instances to profile.

Save the configuration.

In the console, go to Interface Call → Trace Query , select a TraceId, and view the Method Stack tab (shows only instrumented spans, e.g., MariaDB).

Switch to the Code Hotspot tab to see a flame graph that includes the previously invisible java.lang.Thread.sleep() delay (≈990 ms), confirming the missing instrumentation.

The flame graph lists each method’s self‑time, allowing engineers to focus on the widest flames to locate performance bottlenecks.

Core Characteristics

Low Overhead : Automatic trace‑based sampling keeps CPU impact around 5 % and memory usage modest.

Fine Granularity : Correlates trace identifiers with sampled stacks to expose call‑chain‑level hotspots.

Secure & Reliable : Addresses known Async Profiler risks, provides 7‑day data retention, and operates safely in production.

For further reading, see the Arthas trace command documentation, the Async Profiler project, and ARMS user guides.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

java APM Observability Tracing Continuous Profiling Performance Diagnosis

Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.