How Inferred Spans Boost Distributed Tracing Accuracy and Coverage
The article examines the implementation of inferred spans as an advanced observability technique that enriches traditional distributed tracing by automatically generating additional spans, improving coverage, pinpointing latency sources, and offering performance‑optimisation insights while discussing practical integration, algorithmic details, and associated trade‑offs.
Introduction
In modern micro‑service architectures, traditional distributed tracing often captures only coarse‑grained request start/end times, leaving many performance problems undetected. Inferred spans (also called "Inferred Spans") are presented as a novel observability technology that combines stack‑trace analysis with existing trace data to automatically create new spans, thereby extending trace coverage and precision.
Distributed Tracing Overview
Conventional tracing records explicit spans and context propagation across services, but it typically lacks detailed function‑call information, resulting in coarse‑grained visibility and missed latency sources.
Principle of Inferred Spans
Inferred spans are generated by fusing stack‑trace data collected via async‑profiler (which provides low‑overhead wall‑clock timing) with detection‑based trace data. An interleaving algorithm merges the two data streams, creates parent‑child relationships, and produces additional spans named by class and method (e.g., Class#method).
Demo Application
A Java demo named inferred-demo contains a queryOrder endpoint that calls Redis and includes an artificial delay. Using the standard JavaAgent configuration, the trace shows only the endpoint and Redis calls, with the total request time (6.02 ms) but no insight into where most latency occurs.
Enabling Inferred Spans
By adding the option otel.inferred.spans.enabled to the APM agent’s configuration, the inferred‑spans feature is activated. The resulting trace includes automatically generated spans for internal methods such as ApiServlet#handleDelay, revealing that the artificial delay accounts for 4.22 s of the total 4.55 s request time.
Data Collection and Interleaving Algorithm
Regular trace data: explicit spans recorded by the application.
Inferred span data: wall‑clock timings collected by async‑profiler.
The interleaving algorithm aligns timestamps from both sources, merges them into a unified span hierarchy, and records activation/deactivation timestamps and thread IDs for each generated span.
Technical Performance Challenges
The feature relies on async‑profiler, whose overhead is minimal but still affected by sampling interval and trace‑sampling rate. Longer intervals reduce overhead but may miss short‑lived methods; a 50 % sampling rate can halve analysis load while keeping useful visibility.
Conclusion
Inferred spans significantly enhance observability for distributed applications, enabling developers and operators to locate root causes of latency more accurately, accelerate troubleshooting, and improve system stability. When properly tuned, the technique offers a cost‑effective solution with controllable performance impact.
References
[1] Revealing unknowns in your tracing data with inferred spans in OpenTelemetry
[2] Special cases for spans and traces in Splunk APM
AsiaInfo Technology: New Tech Exploration
AsiaInfo's cutting‑edge ICT viewpoints and industry insights, featuring its latest technology and product case studies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
