Operations 14 min read

Mastering Distributed Tracing: From Dapper to Zipkin and OpenTracing

This article explores the fundamentals of distributed tracing, detailing concepts from Google's Dapper paper, the architecture and data model of Zipkin, sampling mechanisms, data propagation, and OpenTracing standards, while providing code examples and practical insights for implementing tracing in microservice environments.

Programmer DD
Programmer DD
Programmer DD
Mastering Distributed Tracing: From Dapper to Zipkin and OpenTracing

Origin

Recently I have been researching and practicing distributed tracing, and I am summarizing the key points.

What is Distributed Tracing

As distributed systems become more complex with microservices, distributed databases, and caches, locating problems across many services becomes difficult. Distributed tracing reconstructs a request's call chain, showing latency, target machines, and status for each service node.

Dapper

Industry tracing systems such as Twitter's Zipkin, Uber's Jaeger, Alibaba's Eagle Eye, and Meituan's Mtrace are all inspired by Google's Dapper paper, which defines concepts, data representation, instrumentation, propagation, collection, storage, and visualization for tracing in microservice architectures.

Trace, Span, Annotations

Dapper introduces the concepts of trace, span, and annotation. A trace (identified by a globally unique traceId) represents the entire request path. Spans form a parent‑child tree; each span is identified by spanId and parentId. Annotations are user‑defined events.

Spans represent RPC calls; the span edge is identified by spanId and parentId. A span consists of client and server parts, generating events such as client‑send (cs), server‑receive (sr), server‑send (ss), and client‑receive (cr). The combined client and server information forms a complete span.

Dapper also defines annotations for custom events, which Zipkin calls binaryAnnotation.

Internal vs. External Data

Tracing relies on two data types: external data (e.g., cs, ss) generated by each node and reported to storage, and internal data (traceId, spanId, parentId) that must be propagated across services to link spans together.

Sampling

To reduce overhead, Dapper samples spans rather than reporting every one. The sampling rate is adjusted adaptively, limiting the number of reported spans while still exposing performance bottlenecks.

Storage

Collected span data is stored centrally. Dapper uses Google BigTable, which efficiently stores sparse span rows keyed by traceId and spanId, enabling stateless collection and simple row‑based queries.

Zipkin

Zipkin is an open‑source implementation of Dapper and a major reference for tracing systems.

Architecture

Zipkin consists of Reporter, Transport, Collector, Storage, API, and UI components.

The Reporter lives in each service, generating spans, propagating internal data, reporting external data, and handling sampling. Transport sends external data via HTTP or Kafka. Collector receives and stores spans. Storage adapters support in‑memory, MySQL, Cassandra, and Elasticsearch. API provides query and ingestion endpoints, and UI visualizes traces.

Data Model (Zipkin v2)

Key fields of a Span include:

trace_id   // 16 or 32‑byte hex string
id         // span identifier
parent_id  // parent span identifier (empty for root)
kind       // CLIENT, SERVER, PRODUCER, CONSUMER
name       // operation name
timestamp  // microseconds since epoch
duration   // span duration (client‑receive minus client‑send)
local_endpoint   // service name, IP, port
remote_endpoint  // peer service info
annotations // list of timestamped events
tags       // user‑defined key/value pairs
debug      // force reporting regardless of sampling
shared     // (currently unused)

Internal Data and Sampling Mechanism

Zipkin propagates internal data using the B3 format (TraceId, SpanId, ParentSpanId, Sampled). Services transmit these values via HTTP headers (e.g., X‑B3‑TraceId) or gRPC context.

The Sampled field can be Defer, Deny, Accept, or Debug, dictating whether a span is reported.

Instrumentation and Reporting Process

Example flow:

Server‑1 initiates a call to Server‑2, creates a root span (CLIENT), records traceId, spanId, empty parentId, and propagates these values.

Server‑2 receives the request, creates a matching SERVER span, records its own endpoint.

Server‑2 calls Server‑3, creating a child CLIENT span.

Server‑3 receives the request, creates a SERVER span.

Server‑3 replies, records duration, and reports its span.

Server‑2 records duration for the Server‑3 call and reports its span.

Server‑2 replies to Server‑1, records duration, and reports its span.

Server‑1 records duration for the Server‑2 call and reports its span.

Four temporary spans are reported, which Zipkin merges into two stored spans.

OpenTracing

OpenTracing provides a vendor‑agnostic API that allows developers to instrument code once and switch tracing implementations (e.g., Zipkin) without code changes.

Adapting Zipkin to OpenTracing requires writing a thin client wrapper.

References

Zipkin – https://zipkin.io

Dapper – https://storage.googleapis.com/pub-tools-public-publication-data/pdf/36356.pdf

Jaeger – https://www.jaegertracing.io/

Eagle Eye – https://cn.aliyun.com/aliware/news/monitoringsolution

Mtrace – https://tech.meituan.com/mt_mtrace.html

Zipkin‑b3‑propagation – https://github.com/openzipkin/b3-propagation

Zipkin‑api – https://zipkin.io/zipkin-api/#/default/post_spans

Zipkin‑proto – https://github.com/openzipkin/zipkin-api/blob/master/zipkin.proto

OpenTracing – https://opentracing.io

OpenTracing Chinese Docs – https://wu-sheng.gitbooks.io/opentracing-io/content/

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

OpenTracingDistributed TracingDapperzipkin
Programmer DD
Written by

Programmer DD

A tinkering programmer and author of "Spring Cloud Microservices in Action"

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.