Understanding Distributed Tracing and Its Use at Liulishuo
This article explains what distributed tracing is, why it is needed alongside logging and metrics for observability, how it works with trace and span IDs, and describes Liulishuo's implementation using OpenTelemetry, W3C Trace Context, and tail‑based sampling to improve backend debugging.
1. What is Distributed Tracing?
According to the OpenTracing definition, distributed tracing (also called distributed request tracing) is a method used to profile and monitor applications, especially those built with a micro‑services architecture, helping pinpoint failures and performance problems.
In simple terms, it is a technique for troubleshooting application issues, particularly in distributed systems.
2. Why Do We Need Distributed Tracing?
Developers often rely on logging and metrics, but distributed tracing is needed to improve observability – the ability to answer runtime problems. Greater observability means being able to answer more operational questions.
Examples of questions:
If a request is slow, where is the bottleneck?
If a request fails, where did the error occur?
When an error happens, whose component is responsible?
Logging, metrics, and distributed tracing are the three pillars of observability; the following sections compare them.
2.1 Metrics
Metrics can tell you that something bad happened (e.g., high error rate or resource usage) but cannot explain why or how to fix it. Metrics aggregate data and lack request‑level context, making it hard to trace errors for individual requests.
The advantage of metrics is low cost and no need for sampling, which yields accurate data, but they provide limited insight for context‑dependent debugging.
2.2 Logging
Logs provide detailed runtime information and can include a unique request ID, allowing extraction of a single request’s context. However, in distributed and highly concurrent environments, correlating logs across services becomes difficult.
Developers must use additional tools to query logs globally, filter by request ID, and still may struggle to reconstruct the execution path because logs are linear and lack explicit relationships.
2.3 Distributed Tracing
To solve missing context and relationship issues, tracing assigns a globally unique TraceID to a request and propagates it across all services (metadata propagation). Each operation is recorded as a Span with its own SpanID and a ParentID, forming a tree of spans.
Spans contain timestamps, status, and additional metadata, allowing calculation of latency between operations and identification of error locations.
With this data, the three earlier questions can be answered:
Identify the slowest Span to locate bottlenecks.
Inspect Spans that contain errors to find failure points.
Use Span metadata to determine which service is responsible.
3. Application at Liulishuo
Liulishuo uses OpenTelemetry as its tracing SDK and adopts the W3C Trace Context standard. To ensure compatibility with third‑party services, the SDK also supports the B3 specification.
W3C Trace Context defines standard HTTP headers for propagating trace metadata, solving format‑compatibility issues.
OpenTelemetry, a merger of OpenCensus and OpenTracing, aims to support the three observability pillars. While only tracing is production‑ready today, its language‑agnostic specifications ensure consistent developer experience across services.
Because tracing can generate large volumes of data, Liulishuo disables head‑based sampling and instead uses tail‑based sampling: all traces are collected, then a processing service decides which traces to retain (e.g., those with errors or high latency) based on configurable rules, reducing storage costs while preserving valuable data.
4. References
https://opentelemetry.io
https://opentracing.io/docs/overview/what-is-tracing
https://www.w3.org/TR/trace-context
https://github.com/open-telemetry/opentelemetry-specification
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
