Mastering Distributed Tracing: From Dapper to Zipkin and OpenTracing
This article explores the fundamentals of distributed tracing, detailing concepts from Google's Dapper paper, the architecture and data model of Zipkin, sampling mechanisms, data propagation, and OpenTracing standards, while providing code examples and practical insights for implementing tracing in microservice environments.
Origin
Recently I have been researching and practicing distributed tracing, and I am summarizing the key points.
What is Distributed Tracing
As distributed systems become more complex with microservices, distributed databases, and caches, locating problems across many services becomes difficult. Distributed tracing reconstructs a request's call chain, showing latency, target machines, and status for each service node.
Dapper
Industry tracing systems such as Twitter's Zipkin, Uber's Jaeger, Alibaba's Eagle Eye, and Meituan's Mtrace are all inspired by Google's Dapper paper, which defines concepts, data representation, instrumentation, propagation, collection, storage, and visualization for tracing in microservice architectures.
Trace, Span, Annotations
Dapper introduces the concepts of trace, span, and annotation. A trace (identified by a globally unique traceId) represents the entire request path. Spans form a parent‑child tree; each span is identified by spanId and parentId. Annotations are user‑defined events.
Spans represent RPC calls; the span edge is identified by spanId and parentId. A span consists of client and server parts, generating events such as client‑send (cs), server‑receive (sr), server‑send (ss), and client‑receive (cr). The combined client and server information forms a complete span.
Dapper also defines annotations for custom events, which Zipkin calls binaryAnnotation.
Internal vs. External Data
Tracing relies on two data types: external data (e.g., cs, ss) generated by each node and reported to storage, and internal data (traceId, spanId, parentId) that must be propagated across services to link spans together.
Sampling
To reduce overhead, Dapper samples spans rather than reporting every one. The sampling rate is adjusted adaptively, limiting the number of reported spans while still exposing performance bottlenecks.
Storage
Collected span data is stored centrally. Dapper uses Google BigTable, which efficiently stores sparse span rows keyed by traceId and spanId, enabling stateless collection and simple row‑based queries.
Zipkin
Zipkin is an open‑source implementation of Dapper and a major reference for tracing systems.
Architecture
Zipkin consists of Reporter, Transport, Collector, Storage, API, and UI components.
The Reporter lives in each service, generating spans, propagating internal data, reporting external data, and handling sampling. Transport sends external data via HTTP or Kafka. Collector receives and stores spans. Storage adapters support in‑memory, MySQL, Cassandra, and Elasticsearch. API provides query and ingestion endpoints, and UI visualizes traces.
Data Model (Zipkin v2)
Key fields of a Span include:
trace_id // 16 or 32‑byte hex string
id // span identifier
parent_id // parent span identifier (empty for root)
kind // CLIENT, SERVER, PRODUCER, CONSUMER
name // operation name
timestamp // microseconds since epoch
duration // span duration (client‑receive minus client‑send)
local_endpoint // service name, IP, port
remote_endpoint // peer service info
annotations // list of timestamped events
tags // user‑defined key/value pairs
debug // force reporting regardless of sampling
shared // (currently unused)Internal Data and Sampling Mechanism
Zipkin propagates internal data using the B3 format (TraceId, SpanId, ParentSpanId, Sampled). Services transmit these values via HTTP headers (e.g., X‑B3‑TraceId) or gRPC context.
The Sampled field can be Defer, Deny, Accept, or Debug, dictating whether a span is reported.
Instrumentation and Reporting Process
Example flow:
Server‑1 initiates a call to Server‑2, creates a root span (CLIENT), records traceId, spanId, empty parentId, and propagates these values.
Server‑2 receives the request, creates a matching SERVER span, records its own endpoint.
Server‑2 calls Server‑3, creating a child CLIENT span.
Server‑3 receives the request, creates a SERVER span.
Server‑3 replies, records duration, and reports its span.
Server‑2 records duration for the Server‑3 call and reports its span.
Server‑2 replies to Server‑1, records duration, and reports its span.
Server‑1 records duration for the Server‑2 call and reports its span.
Four temporary spans are reported, which Zipkin merges into two stored spans.
OpenTracing
OpenTracing provides a vendor‑agnostic API that allows developers to instrument code once and switch tracing implementations (e.g., Zipkin) without code changes.
Adapting Zipkin to OpenTracing requires writing a thin client wrapper.
References
Zipkin – https://zipkin.io
Dapper – https://storage.googleapis.com/pub-tools-public-publication-data/pdf/36356.pdf
Jaeger – https://www.jaegertracing.io/
Eagle Eye – https://cn.aliyun.com/aliware/news/monitoringsolution
Mtrace – https://tech.meituan.com/mt_mtrace.html
Zipkin‑b3‑propagation – https://github.com/openzipkin/b3-propagation
Zipkin‑api – https://zipkin.io/zipkin-api/#/default/post_spans
Zipkin‑proto – https://github.com/openzipkin/zipkin-api/blob/master/zipkin.proto
OpenTracing – https://opentracing.io
OpenTracing Chinese Docs – https://wu-sheng.gitbooks.io/opentracing-io/content/
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Programmer DD
A tinkering programmer and author of "Spring Cloud Microservices in Action"
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
